TL;DR
这个设备互连是什么?
正如 Almog David 在评论中所述,这可以告诉您一个 GPU 是否可以直接访问另一个 GPU 的内存。
它对计算能力有什么影响?
其唯一的影响是针对多 GPU 训练。如果两个 GPU 具有设备互连,则数据传输速度会更快。
为什么不同的 GPU 会有不同的结果?
这取决于硬件设置的拓扑。一块主板只有那么多 PCI-e 插槽,这些插槽通过同一总线连接。 (检查拓扑结构nvidia-smi topo -m
)
由于硬件原因(故障、驱动程序不一致......),它会随着时间的推移而改变吗?
我认为顺序不会随着时间的推移而改变,除非 NVIDIA 改变默认的枚举方案。还有一点细节here https://stackoverflow.com/a/26123645/1097517
解释
该消息生成于BaseGPUDeviceFactory::CreateDevices https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L978功能。它迭代每对设备按照给定的顺序并打电话cuDeviceCanAccessPeer https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g496bdaae1f632ebfb695b99d2c40f19e。正如 Almog David 在评论中所说,这只是表明您是否可以在设备之间执行 DMA。
您可以执行一些测试来检查顺序是否重要。考虑以下片段:
#test.py
import tensorflow as tf
#allow growth to take up minimal resources
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
现在让我们检查不同设备顺序的输出CUDA_VISIBLE_DEVICES
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py
...
2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N
2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N
2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y
2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N
...
$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py
...
2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N Y
2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N Y N
2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N Y N N
2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y N N N
...
您可以通过运行来获得连接的更详细说明nvidia-smi topo -m
。例如:
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SYS SYS 0-7,16-23
GPU1 PHB X SYS SYS 0-7,16-23
GPU2 SYS SYS X PHB 8-15,24-31
GPU3 SYS SYS PHB X 8-15,24-31
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
我相信您在列表中的位置越靠前,转移的速度就越快。