分布式深度学习技术-AllReduce

2023-11-04

(如果只想了解核心思想,只需要关注红色字体部分即可了解AllReduce和Ring-AllReduce算法的核心思想)

Hello, I am Yuichiro Ueno. I participated in a summer internship program at PFN in 2017, and I currently work as a part-time engineer. I am an undergraduate student at Tokyo Institute of Technology, and my research topic is High-Performance, Parallel and Distributed Computing.

In this blog post, I will describe our recent study on algorithms for AllReduce, a communication operation used for distributed deep learning.
(AllReduce算法,是用于分布式深度学习的通信运算)

What is Distributed Deep Learning?

Currently, one of the significant challenges of deep learning is it is a very time-consuming process. Designing a deep learning model requires design space exploration of a large number of hyper-parameters and processing big data. Thus, accelerating the training process is critical for our research and development. Distributed deep learning is one of the essential technologies in reducing training time.

We have deployed a private supercomputer “MN-1” to accelerate our research and development process. It is equipped with 1024 NVIDIA® Tesla® P100 GPUs and Mellanox® InfiniBand FDR interconnect and is the most powerful supercomputer in the industry segment in Japan. By leveraging MN-1, we completed training a ResNet-50 model on the ImageNet dataset in 15 minutes.

Communication among GPUs is one of the many challenges when training distributed deep learning models in a large-scale environment. The latency of exchanging gradients over all GPUs is a severe bottleneck in data-parallel synchronized distributed deep learning.
(GPU之间的通信是在大规模环境中训练分布式深度学习模型时的众多挑战之一。 在所有GPU上交换梯度的延迟是数据并行同步分布式深度学习中的严重瓶颈。)
什么是数据并行同步分布式深度学习,可以参考这篇文章

How is the communication performed in distributed deep learning? Also, why is the communication so time-consuming?

The Importance of AllReduce in Distributed Deep Learning

In synchronized data-parallel distributed deep learning, the major computation steps are:
在同步数据并行分布式深度学习中,主要计算步骤如下:

  1. Compute the gradient of the loss function using a minibatch on each GPU.
    使用每个GPU上的minibatch计算损失函数的梯度
  2. Compute the mean of the gradients by inter-GPU communication.
    通过GPU间通信计算梯度的平均值
  3. Update the model.
    更新模型

AllReduce就是用来计算上面第二步中的多GPU之间梯度的均值的方法
To compute the mean, we use a collective communication operation called “AllReduce.”

As of now, one of the fastest collective communication libraries for GPU clusters is NVIDIA Collective Communication Library: NCCL[3]. It achieves far better communication performance than MPI, which is the de-facto standard communication library in the HPC community. NCCL is indispensable for achieving high performance in distributed deep learning using ChainerMN. Without it, the ImageNet 15-min feat could not have been achieved[2].

Our researchers and engineers were curious about NCCL’s excellent performance. Since NCCL is not an open source library, we tried to understand the high performance of the library by developing and optimizing an experimental AllReduce library.

Algorithms of AllReduce( AllReduce算法 )

First, let’s take a look at the AllReduce algorithms.

AllReduce is an operation that reduces the target arrays in all processes to a single array and returns the resultant array to all processes.

AllReduce是一种将所有process中的目标数组(即表示All),减少为单个数组(即表示Reduce)并将结果数组返回给所有process的操作。(比如将所有GPU上的梯度值,假设数组表示,合并并执行reduce操作成一个数组,并返回给所有GPU)

下面以四个process为例,讲解AllReduce的执行流程。假设大P为process总数,小p表示第p个process,每个process上有长度为N的数组。

Now, let P P P the total number of processes. Each process has an array of length N called A p A_p Ap. i-th element of the array of process p ( 1 ≤ p ≤ P ) (1≤p≤P) (1pP) is A p , i A_{p,i} Ap,i.

The resulting array B is to be:

B i = A 1 , i O p A 2 , i O p … O p A P , i B_i = A_{1,i}\quad Op\quad A_{2,i}\quad Op\quad …\quad Op\quad A_{P,i} Bi=A1,iOpA2,iOpOpAP,i
Here, Op is a binary operator. SUM, MAX, and MIN are frequently used. In distributed deep learning, the SUM operation is used to compute the mean of gradients. In the rest of this blog post, we assume that the reduction operation is SUM. Figure 1 illustrates how the AllReduce operation works by using an example of P=4 and N=4.

Fig.1 AllReduce Operation

There are several algorithms to implement the operation. For example,

a straightforward one is to select one process as a master, gather all arrays into the master, perform reduction operations locally in the master, and then distribute the resulting array to the rest of the processes.
最直接的方式就是选取一个process(GPU)作为master,把其他所有process(GPU)上的数组(比如数组每一个元素代表一个参数的梯度),然后在master上执行reduce操作,并将计算结果在分发到其他所有的process中。

Although this algorithm is simple and easy to implement, it is not scalable. The master process is a performance bottleneck because its communication and reduction costs increase in proportion to the number of total processes.
虽然上面AllReduce算法简单且易于实现,但它不具有可扩展性。 master process是一个性能瓶颈因为它的通信和reduce成本与总process数成比例增加。

Faster and more scalable algorithms have been proposed. They eliminate the bottleneck by carefully distributing the computation and communication over the participant processes.
Such algorithms include Ring-AllReduce and Rabenseifner’s algorithm[4].

一种更好的算法就是Ring-AllReduce算法
We will focus on the Ring-AllReduce algorithms in this blog post. This algorithm is also employed by NCCL [5] and baidu-allreduce[6].

Ring-AllReduce

Fig.2 Example of a process ring

First, each process divides its own array into P subarrays, which we refer to as “chunks”. Let chunk[p] be the p-th chunk.
每个process把自己的数组分成P(P为process总数)个子数组,子数组称之为chunks,令chunk[p]表示第p个chunk

Next, let us focus on the process [p]. The process sends chunk[p] to the next process, while it receives chunk[p-1] from the previous process simultaneously (Fig.3).
假设关注第p个process,该process把chunk[p]发给下一个process,并从上一个process接收chunk[p-1],比如图3中Process2 把2发送给Process3,并从Process1接收1。

Fig.3 Each process sends its chunk[p] to the next process [p+1]

Then, process p performs the reduction operation to the received chunk[p-1] and its own chunk[p-1], and sends the reduced chunk to the next process p+1 (Fig.4).
process p计算把接受到的chunk[p-1]和自己的chunk[p-1]一起计算reduce,并把计算后的chunk值发给下一个process,如图4所示,计算一轮的时候,process2把从process1接收到的1和自身的1计算reduce(比如SUM),然后发送给process3。

Fig.4 Each process sends a reduced chunk to the next process

By repeating the receive-reduce-send steps P-1 times, each process obtains a different portion of the resulting array (Fig.5).
这样,经过P-1次之后,每个process都持有结果的一部分(在这里是一个参数的reduce值,假设总共4个参数)

Fig.5 After P-1 steps, each process has a reduced subarray.

In other words, each process adds its local chunk to a received chunk and send it to the next process. In other words, every chunk travels all around the ring and accumulates a chunk in each process. After visiting all processes once, it becomes a portion of the final result array, and the last-visited process holds the chunk.

Finally, all processes can obtain the complete array by sharing the distributed partial results among them. This is achieved by doing the circulating step again without reduction operations, i.e., merely overwriting the received chunk to the corresponding local chunk in each process. The AllReduce operation completes when all processes obtain all portions of the final array.
最后,每个process之间在循环一次(不计算reduce)就可以将全部reduce后的值发送到每个process上。

普通的AllReduce和Ring-AllReduce之间通信总数的对比如下:
Let’s compare the amount of communication of Ring-AllReduce to that of the simple algorithm we mentioned above.

In the simple algorithm, the master process receives all the arrays from all other processes, which means the total amount of received data is ( P – 1 ) × N (P–1)×N (P1)×N. After the reduction operation, it sends the arrays back to all the processes, which is again ( P – 1 ) × N (P–1)×N (P1)×N data. Thus, the amount of communication of the master process is proportional to P P P.

In the Ring-AllReduce algorithm, we can calculate the amount of communication in each process in the following way. In the earlier half of the algorithm, each process sends an array, the size of which is N / P N/P N/P, P − 1 P−1 P1 times. Next, each process again sends an array of the same size P − 1 P-1 P1 times. The total amount of data each process sends throughout the algorithm is 2 N ( P − 1 ) / P 2N(P−1)/P 2N(P1)/P, which is practically independent of P.

Thus, the Ring-Allreduce algorithm is more efficient than the simple algorithm because it eliminates the bottleneck process by distributing computation and communication evenly over all participant processes. Many AllReduce implementations adopt Ring-AllReduce, and it is suitable for distributed deep learning workloads as well.

Implementation and Optimization

The Ring-AllReduce algorithm is simple to implement if basic send and receive routines are given. baidu-allreduce[6] is built on top of MPI using MPI_Send and MPI_Recv.

However, we tried to do further optimizations by using InfiniBand Verbs API instead of MPI. To fully utilize hardware resources, the algorithm has multiple stages such as memory registration (pinning), cuda-memcpy, send, reduction, receive, and memory deregistration, and they are processed in a software pipeline. Here, “registration” and “deregistration” are pre- and post-processing stages for DMA data transfer. Such low-level operations are abstracted out in MPI send/receive routines, and we are not able to split them into pipeline stages. To increase the granularity of the communication and computation, we further divide chunks into smaller sub-chunks. Also, we introduce a memory pool to hide memory allocation overhead.

Performance Evaluation

For performance evaluation, we compared our prototype (called PFN-Proto) to several AllReduce implementations shown in the Appendix.

Our prototype implementation currently focuses on inter-node communication; it is not optimized for intra-node communication using shared memory or GPU-to-GPU DMA data transfer. We evaluated the implementations in one process per node configuration. For Open MPI [7], our company is yet to introduce the latest version 3.x series because the most recent series has a minor issue related to GPUDirect. So, we used version 2.1.3 instead.

We used our private supercomputer MN-1 for this experiment, as shown in the “Experimental environment” below. Eight processes were run, where one process ran on one computing node. The target data size is 256MB.

Fig.6 AllReduce Execution Time

Figure 6 shows the result of the evaluation. Each bar indicates the median of 10 runs. The error bar indicates confidence intervals. The details of each library are shown in the “software versions” below.

First, let’s look at the median values. Our experimental implementation, PFN-Proto, showed the fastest time, which is approximately 82%, 286%, 28%, 1.6% better than ompi, ompi-cuda, Baidu, NCCL, respectively. One thing worth mentioning, which is not in the graph, is that Baidu achieved the fastest single-run time 0.097 [s] among all the five libraries.

Next, we focus on the variance of the performance. Maximum and minimum runtimes of PFN-Proto and NCCL are within +/- 3% and +/- 6%, respectively. In contrast, Baidu’s maximum value is 7.5x its median, because its first run takes a very long time. Its maximum runtime excluding the first run is +9.6% over the median, which is still more significant than those of NCCL and PFN-Proto.

Our hypothesis is that the performance variances of MPI and MPI-based routines are attributed to MPI’s internal behavior related to memory operations. MPI’s programming interface hides memory allocation and registration operations for InfiniBand communication. Timings of such operations are not controllable from those AllReduce implementations.

# Summary
We described the AllReduce communication pattern, which is very important for distributed deep learning. In particular, we implemented the Ring-AllReduce algorithm in our experimental communication library, and it achieved comparable performance to NCCL library released by NVIDIA. The implementation efficiently utilizes available hardware resources through advanced optimization such as using InfiniBand Verbs API and software pipelining. We continue our research and development on accelerating distributed deep learning.

Caveats: our implementation is experimental, and we only demonstrated the performance on our in-house cluster. NCCL is a highly practical and usable library thanks to its performance suitability and availability on a wide range of IB-connected NVIDIA GPU clusters.

Acknowledgement

I would like to thank my mentors and the team for the kind support and feedbacks. Since my internship period last year, I have been give access to rich computation resources, and it has been a fantastic experience.

From Mentors:

This project started with a question: “how does NCCL achieve such high and stable performance?” It is an advanced and experimental topic, but Mr. Ueno achieved a remarkable result with his high motivation and technical skills.

PFN is looking for talents, not only in the deep learning/machine learning field but a full range of technical areas from hardware to software. Please visit https://www.preferred-networks.jp/en/jobs for more information.

For students who are interested in high-performance computing and other technologies, PFN offers international internship opportunities, as well as domestic programs for Japanese students. The application period has finished this year, but be ready for the next opportunity!

References

[1] Preferred Networks officially released ChainerMN version 1.0.0
[2] Akiba, et al., “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”
[3] NVIDIA Collective Communications Library
[4] Rabenseifner, “Optimization of Collective Reduction Operations”, ICCS 2004
[5] Jeaugey, “Optimized Inter-GPU Collective Operations with NCCL”, GTC 2017
[6] baidu-allreduce
[7] Open MPI
[8] New ChainerMN functions for improved performance in cloud environments and performance testing results on AWS
[9] Tsuzuku, et al., “Variance-based Gradient Compression for Efficient Distributed Deep Learning”, In Proceedings of ICLR 2018 (Workshop Track)

原文:Technologies behind Distributed Deep Learning: AllReduce

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

分布式深度学习技术-AllReduce 的相关文章

  • 永别了gitee图床,阿里云图床我来啦!!!

    文章目录 缘由 前期准备工作一 下载了 Typora 和Picgo 1 本人 Typora 版本 2 本人Picgo 版本 准备工作二 注册阿里云账号 重要 一定要看一下 步骤一 点击控制台 步骤二 选择 对象储存 并开通 步骤三 点击左侧

随机推荐

  • javascript 大文件下载,分片下载,断点续传

    javascript 大文件下载 分片下载 断点续传 文章目录 javascript 大文件下载 分片下载 断点续传 1 获取文件大小 2 切片下载 3 合并数据 4 下载到本地 5 成功 6 完整代码 既然是断点续传 自然离不开分片下载
  • 小程序自动更换标题文字及icon的方法

    一 动态生成底部tabBar的icon和文字 wx setTabBarItem index 2 text 商品 iconPath assets StoreLife 2x png selectedIconPath assets storeLi
  • 关于Mysql的驱动(org.gjt.mm.mysql.Driver)问题

    目前我知道的连接mysql的驱动有两个 一个是org gjt mm mysql Driver 另外一个是com mysql jdbc Driver 我做毕设时使用的org gjt mm mysql Driver 这个比较老了 现在使用的比较
  • QT VS与QT的项目配置

    VS中添加Qt模块 VS中添加Qt文件
  • 单端、差分、伪差分输入

    单端信号 单端信号 single end 是相对于差分信号而言的 单端输入指信号有一个参考端和一个信号端构成 参考端一般为地端 ADC单端输入 比如说UART232串口中 发送端TXD 接收端RXD 参考端是地 GND 是典型的单端信号输入
  • 对象相等比较

    String的相等比较 对于String类型而言 一般用 或者equales做相等比较 前者比较字符串的引用 后者比较字符串的值 字符串常量的值存储于常量池中 只要值相同 那么引用的就是同一个字符串常量 也就是说 和equals效果一样 字
  • faster rcnn 训练自己的数据集---踩坑记录!!!

    下载代码 git clone https github com jwyang faster rcnn pytorch git 也可以暴力下载 直接download压缩包 2 解压完 cd到faser rcnn pytorch文件夹中 再创建
  • C++快速排序和一些细节思考

    一 原理 选一个基准数 通常选需要排序数组的第一个元素 将该基准数从两端开始比较 找到从左边起比此基数大的数 从右边起比此基数小的数 然后交换两数 两端相遇后一轮截止 相遇的位置就是基准数的正确位置 且基准数左边都小于此基准数 右边都大于此
  • 固高运动控制卡QT和VS(MFC)的配置

    一 QT配置 第一步 将需要的文件保存在项目下 gts h gts dll gts lib 第二步 将 gts h 加入项目 第三步 在pro文件中添加 lib文件 添加外部库后 代码为 如果不对自己导入外部库即可 win32 LIBS L
  • 【RuoYi-Vue-Plus】问题笔记 07 - V3.5.0 Redisson 报错 Unable to send PING command over channel

    文章目录 前言 参考目录 问题说明 问题解决方法 前言 最近找了一下终于解决了 Redisson 的 RedisTimeoutException 报错问题 在此记录一下 参考目录 Redisson Issues 3273 Redisson
  • 浩辰CAD 2021:深度升级,全面提升用户体验!

    在全球新冠疫情背景下 全球经济发展速度明显减缓 国内国外的市场竞争更加激烈 各企业对于提升数字化 网络化 智能化发展水平的需求也愈发迫切 这就需要企业配备更加全面和系统化的数字化设计平台 提高创新研发能力和市场竞争力 快速响应市场需求 把握
  • UnityVR--机械臂场景4-礼物和圣诞树

    本文场景中被抓取的物体是礼物 使用机械臂抓取礼物 将礼物放置在圣诞树的某个位置 来装饰圣诞树 1 礼物的设置 礼物必须具备Collider和Rigidbody 因为需要手爪放开后 礼物会自由掉落的效果 还要将礼物设置为 Goods 的标签
  • 十句话,不黄不色,但很经典~~~~~~~~~~

    1 如果钱还宽裕 别养二奶 偷偷养几个贫困山区的学生 别让人家知道你是谁 要不然见面了多尴尬 多不好意思 但是你心里一定会觉得舒坦 比包二奶提心吊胆的要好得多 如果真想包也可以包一个 好事坏事一起做 人吗 本来就复杂 2 遇到夜里摆地摊的
  • 浅谈 C/C++ 的条件编译

    1 条件编译的时机 我们都知道vscode其实是一个编辑器 你要在上面跑C或者C 你需要配置编译器 拿编译器是怎样吧一个文本文件变成一个可执行文件的呢 那必然是经历以下这四步 预处理 宏替换 头文件的展开 去注释 条件编译 编译 将预处理后
  • go语言-数组指针

    1 数组指针 1 数组指针与指针数组 这俩概念原本在c语言中就是一个绕口令般的存在 尽管从类型角度来看两者并没什么相似的地方 但是在go语言中对这两个类型的设定做出了一些不同的规定 首先交代一下基本概念 数组指针 指的是一个指针 只不过这个
  • RPC的详解和使用

    目录 一 基础介绍 1 1 为什么需要RPC 1 2 RPC介绍 二 RPC通信实现原理 2 2 RPC调用过程 三 RPC框架的安装和使用 PHP 3 1 php目前流行的RPC框架有哪些 3 2 Hprose框架的使用案例 项目开发比较
  • Could not build wheels for mmcv-full, which is required to install pyproject.toml-based projects

    Could not build wheels for mmcv full which is required to install pyproject toml based projects 先安装mim 注意事项 需要降低mmcv版本 p
  • mbedTLS常用结构体

    ECP密钥对mbedtls ecp keypair brief ECP key pair structure A generic key pair that could be used for ECDSA fixed ECDH etc no
  • windows2016安装.netFramework 3.5

    2016服务器默认安装的是4 6 2的 net但是有时候我们经常会需要用到3 5版本 但是2016又不能像以前的版本一样直接下载安装 这里介绍2个办法进行安装 1 使用服务器安装工具安装 打开服务器管理器 选择添加角色和功能 下一步 选择第
  • 分布式深度学习技术-AllReduce

    如果只想了解核心思想 只需要关注红色字体部分即可了解AllReduce和Ring AllReduce算法的核心思想 Hello I am Yuichiro Ueno I participated in a summer internship