论文阅读笔记（四十七）：Attention Is All You Need

2023-11-13

Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence- aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence- aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].

Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1,…,xn) to a sequence of continuous representations z = (z1,…,zn). Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

这里写图片描述

3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

这里写图片描述

Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.

3.2.1 Scaled Dot-Product Attention
We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension d , and values of dimension d . We compute the dot products of the k√v query with all keys, divide each by dk, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as:

这里写图片描述

The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 . Additive attention computes the compatibility function using a feed-forward network with dk a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot products by 1 . dk

3.2.2 Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv -dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

这里写图片描述

Where the projections are parameter matrices WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv andWO ∈Rhdv×dmodel.
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:
• In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
• The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

3.3 Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

这里写图片描述

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality dff =2048.

3.4 Embeddings and Softmax
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by3.5 Positional Encoding√dmodel.

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].

In this work, we use sine and cosine functions of different frequencies:
这里写图片描述

where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of PEpos.

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Why Self-Attention
In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, …, xn) to another sequence of equal length (z1, …, zn), with xi, zi ∈ Rd, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r). We plan to investigate this approach further in future work.

A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to O(k · n · d + n · d2). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

Conclusion
In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.

The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor.

笔记

attention

论文阅读笔记（四十七）：Attention Is All You Need 的相关文章

使用 Github Action 将 github 仓库同步到 gitee

背景最近将 CI CD 流程改造了一波使用 ArgoCD 做 gitops 这样所有的集群 Yaml 文件就都存放在了 github 上的一次仓库里但是小服务器放在家里从 github 上拉代码时总是时不时有网络问题导致集群资源无
-1. HTML&CSS 基础总结

HTML CSS Favorite 1 基础知识 1 HTML 1 1基本结构标签 1 骨架 2 排版标签标题标签 h1 标题文本 h1 h1 gt h1 h6 段落标签 p 段落文本内容 p 水平线标签 hr 水平线换行标签 br 换
Android 解析软件包时出现问题 -- Error staging apk from content URI

Android Version 8 1 使用场景在Rk3288w Android 8 1 的测试设备上安装文件管理器应用程序若打开 apk文件会出现解析包错误提示即安装失败影响使用如下为ActivityManagerSe
华为OD机试真题-找出重复代码【2023.Q1】

题目描述小明负责维护项目下的代码需要查找出重复代码用以支撑后续的代码优化请你帮助小明找出重复的代码重复代码查找方法以字符串形式给出两行代码字符审长度1 lt length lt 100 由英文字母数字和空格组成找出两行代码
扩展磁盘大小

各个系统可能会有些差异主要存在于文件系统和卷组名上一定要注意如果要进行扩展大小的话一定要先把原来的那个卷的数据进行保存好数据 bin bash 使用这个脚本时只需将第一个参数设置为想扩展多大即可但是需要注意的是若移植到位置的

随机推荐

java char长度_Java中char的字节数

以前一直以为char占一个字节后来发现远没这么简单 Java中char的字节数和编码有关使用UTF 8 英文字符占1个字节中文占3个字节下面在是在Ubuntu中测试的结果 public static void main Strin
方差

方差在概率统计中有很重要的作用 2公式方差方差是实际值与期望值之差平方的期望值而标准差是方差算术平方根 1 在实际计算中我们用以下公式计算方差方差是各个数据与平均数之差的平方的和的平均数即其中 x 表示样本的平均数
前缀树

前缀树的结构 Trie树又叫字典树前缀树 Prefix Tree 单词查找树或键树是一种多叉树结构如下图上图是一棵Trie树表示了关键字集合 a to tea ted ten i in inn Trie树的基本性质根节点不包含
活动报名

活动议程日期 8月25日周五时间主题 10 00 10 05 开场简介马恺声清华大学交叉信息研究院助理教授青源会会员 10 05 10 50 基于商用硬件的同态加密加速张明喆中国科学院信息工程研究所信息安全国家重点实验室副
layui table直接编辑

修改lay modules table js 在TPL HEADER中 if item2 type checkbox 复选框修改为不显示标题行复选框 if item2 type checkbox item2 header undefine
移动开发期末大作业-备忘录app

备忘录app 资源链接在文末前言 2022年软件工程专业上学期的一个安卓的课设开发工具 androidStudio 开发语言 Java 介绍这是一个备忘录APP 具有基本的备忘录功能和云端同步功能实现备忘录功能的部分借鉴了xmenw
python接收易语言数据中文乱码

易语言代码 book name 发送到发 txt 提交信息引号 book name 引号引号编辑框下载内容引号到文本网页访问对象 http 127 0 0 1 8000 download 1 提交信息 Content T
bootstrap click事件自动刷新页面问题

1 将按钮的type类型改为button
Linux线程编程

参考 Linux多线程编程初探作者峰子仰望阳光网址 https www cnblogs com xiehongfeng100 p 4620852 html 目录线程概述线程概念线程与进程区别为何用线程线程开发api概要线
存储过程与控制结构

存储过程与函数的区别存储过程是没有返回值的函数函数是有返回值的存储过程创建存储过程 delimiter create procedure procedureName begin sql 语句 end delimiter 查看已有存储过
VUE.js

VUE 1 1 概述 Vue 是一套前端框架免除原生JavaScript中的DOM操作简化书写之前也学习过后端的框架 Mybatis Mybatis 是用来简化 jdbc 代码编写的而 VUE 是前端的框架是用来简化 JavaSc
STM32(HAL库)驱动st7789LCD屏幕（7引脚240*240）

目录 1 简介 2 CubeMX初始化配置 2 1 基础配置 2 1 1 SYS配置 2 1 2 RCC配置 2 2 屏幕引脚配置 2 3 项目生成 3 KEIL端程序整合 3 1 LCD驱动添加 3 2 函数修改 3 2 1 lcd h修
pyqt5_tools下找不到designer.exe的问题

pyqt tools 5 15 版本 designer exe在路径 Lib site packages qt5 applications Qt bin下
第11讲：vue脚手架集成ElementUI

一创建vue路由项目并添加ElementUI支持 ElementUI官方网站 ElementUI组件创建路由项目请参考路由开发使用如下命令集成ElementUI npm i element ui S 在src main js文件中引
MySQL日期函数

MySQL日期函数 1 adddate 语法 adddate date interval expr unit 或 adddate expr days 用于给时间类型增加时间间隔默认为天 unit year month day day ho
STM32 USB CDC VPC

STM32 USB CDC VPC 关键字 STM32 STM32CubeMX HAL库 USB 虚拟串口串口不定长接收 1 简介通过使用stm32cubemx 实现USB CDC虚拟串口并与硬件串口进行数据传输实现了硬件串口数据的
手机怎么访问服务器未响应,手机设置路由器服务器未响应怎么办

手机设置路由器服务器未响应怎么办内容精选换一换自定义线路解析支持DNS根据访问者的IP地址返回特定的IP地址如果访问者所属Local DNS不支持扩展DNS机制 Extension Mechanisms for DNS EDNS
将数组作为参数进行传递（转）

有两种传递方法一种是function int a 另一种是function int a 这两种两种方法在函数中对数组参数的修改都会影响到实参本身的值对于第一种根据之前所学形参是实参的一份拷贝是局部变量但是数组是个例外因为数组的
MySQL5.7_空间数据操作

MySQL5 7 空间数据操作创建数据库空间数据库 spatialDB进行测试 USE spatialDB DROP TABLE t point CREATE TABLE t point id int 11 NOT NULL AUTO
论文阅读笔记（四十七）：Attention Is All You Need

Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that

论文阅读笔记（四十七）：Attention Is All You Need

论文阅读笔记（四十七）：Attention Is All You Need 的相关文章

随机推荐

热门标签