Tensorflow:GPU 加速仅在首次运行后发生

2024-01-03

我已经在我的机器(Ubuntu 16.04)上安装了 CUDA 和 CUDNNtensorflow-gpu.

使用的版本:CUDA 10.0、CUDNN 7.6、Python 3.6、张量流 1.14


这是输出nvidia-smi,显示显卡配置。

| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    On   | 00000000:02:00.0 Off |                  N/A |
| N/A   44C    P8    N/A /  N/A |    675MiB /  4046MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1502      G   /usr/lib/xorg/Xorg                           363MiB |
|    0      3281      G   compiz                                        96MiB |
|    0      4375      G   ...uest-channel-token=14359313252217012722    69MiB |
|    0      5157      C   ...felipe/proj/venv/bin/python3.6            141MiB |
+-----------------------------------------------------------------------------+

这是输出device_lib.list_local_devices()(张量流辅助方法显示它可以看到哪些设备),显示我的 GPU 对张量流可见:

[name: "/device:CPU:0"
  device_type: "CPU"
  memory_limit: 268435456
  locality {
  }
  incarnation: 5096693727819965430, 
name: "/device:XLA_GPU:0"
  device_type: "XLA_GPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 13415556283266501672
  physical_device_desc: "device: XLA_GPU device", 
name: "/device:XLA_CPU:0"
  device_type: "XLA_CPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 14339781620792127180
  physical_device_desc: "device: XLA_CPU device", 
name: "/device:GPU:0"
  device_type: "GPU"
  memory_limit: 3464953856
  locality {
    bus_id: 1
    links {
    }
  }
  incarnation: 13743207545082600644
  physical_device_desc: "device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0"
]

现在我们来实际使用 GPU 进行计算。我用了一小段代码运行一些虚拟矩阵乘法在 CPU 和 GPU 上比较性能:

shapes = [(50, 50), (100, 100), (500, 500), (1000, 1000), (10000,10000), (15000,15000)]

devices = ['/device:CPU:0', '/device:XLA_GPU:0']

for device in devices:
    for shape in shapes:
        with tf.device(device):
            random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
            dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
            sum_operation = tf.reduce_sum(dot_operation)

        # Time the actual runtime of the operations
        start_time = datetime.now()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
            result = session.run(sum_operation)
        elapsed_time = datetime.now() - start_time

        # PRINT ELAPSED TIME, SHAPE AND DEVICE USED       

这里有惊喜。我第一次运行包含此代码块的单元格(我在 jupyter 笔记本上),GPU 的计算时间比 CPU 长得多:

# output of first run: CPU is faster
----------------------------------------
Input shape: (50, 50) using Device: /device:CPU:0 took: 0.01
Input shape: (100, 100) using Device: /device:CPU:0 took: 0.01
Input shape: (500, 500) using Device: /device:CPU:0 took: 0.01
Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.02
Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.22
Input shape: (15000, 15000) using Device: /device:CPU:0 took: 21.23
----------------------------------------
Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 2.82
Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.17
Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.18
Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.20
Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 28.36
Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 93.73
----------------------------------------

惊喜#2:当我重新运行包含虚拟矩阵乘法代码的单元时,GPU 版本要快得多(如预期):

# output of reruns: GPU is faster
----------------------------------------
Input shape: (50, 50) using Device: /device:CPU:0 took: 0.02
Input shape: (100, 100) using Device: /device:CPU:0 took: 0.02
Input shape: (500, 500) using Device: /device:CPU:0 took: 0.02
Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.04
Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.78
Input shape: (15000, 15000) using Device: /device:CPU:0 took: 24.65
----------------------------------------
Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 0.14
Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.12
Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.13
Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.14
Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 1.64
Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 5.29
----------------------------------------

所以我的问题是:为什么我运行一次代码后才真正发生GPU加速?

我可以看到 GPU 设置正确(否则根本不会发生加速)。是由于某种初始开销造成的吗? GPU 需要吗warm-up在我们实际使用它们之前?

P.S.:在两次运行中(即 GPU 速度较慢的一次运行,以及 GPU 速度较快的一次运行),我可以看到 GPU 使用率为 100%,因此肯定正在使用它。

P.S.:仅在第一次运行时,GPU 似乎没有启动已接。如果我运行两次、三次或多次,则第一次运行后的所有运行都会成功(即 GPU 计算速度更快)。


罗伯特·克罗维拉的评论 https://stackoverflow.com/questions/56999493/tensorflow-gpu-acceleration-only-happens-after-first-run/57023579#comment100532663_56999493让我研究了 XLA 的事情,这帮助我找到了解决方案。

事实证明,GPU 通过两种方式映射到 Tensorflow 设备:作为 XLA 设备和作为普通 GPU。

这就是为什么有两种设备,一种名为"/device:XLA_GPU:0"和另一个"/device:GPU:0".

我需要做的就是是为了激活"/device:GPU:0"反而。现在 GPU 立即被 Tensorflow 接管。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Tensorflow:GPU 加速仅在首次运行后发生 的相关文章

随机推荐