Ubuntu16.04 + NVIDIA RTX3090 + Pytorch + Tensorflow

2023-05-16

文章目录

  • 说明
  • 有用链接
  • 显卡驱动安装
    • 文件下载
    • 一次性安装显示驱动和cuda计算套件
    • 仅安装显示驱动
    • 仅安装cuda计算套件
  • 安装Pytorch
    • 安装pytorch1.7
    • 源码安装pytorch1.8
    • 源码安装torchvision
  • RTX3090性能问题
    • 深度学习
    • Pytorch上测试结果
      • 不同卷积类型
      • MNIST 分类
    • Tensorflow 上测试结果
      • CFAR图像分类
    • 需要注意的事项
      • Tensor Float32
  • 奇怪的现象
  • Pytorch 不同版本在不同设备上的性能测试
    • 3090+CUDA11结果
    • 1080TI+CUDA11
    • 1080TI+CUDA10

说明

记录RTX3090显卡显示驱动与cuda计算驱动安装过程, 本文均采用 run 格式的安装文件.

有用链接

  • pytorch二进制包下载

显卡驱动安装

文件下载

  • cuda 从 这里 下载安装文件 cuda_11.1.0_455.23.05_linux.run
  • driver 从 这里 下载安装文件 NVIDIA-Linux-x86_64-455.45.01.run, NVIDIA-Linux-x86_64-455.23.04.run

一次性安装显示驱动和cuda计算套件

参照 Ubuntu 16.04 LTS + CUDA8.0 + cudnn6.0 中的步骤安装即可.

安装完成后重启,进入/usr/local/cuda-11.1/samples/1_Utilities/deviceQuery安装目录, 执行 sudo make 命令,接着执行 ./deviceQuery 命令查看设备及驱动等信息.

/usr/local/cuda-11.1/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce RTX 3090"
  CUDA Driver Version / Runtime Version          11.1 / 11.1
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24265 MBytes (25443893248 bytes)
  (82) Multiprocessors, (128) CUDA Cores/MP:     10496 CUDA Cores
  GPU Max Clock rate:                            1785 MHz (1.78 GHz)
  Memory Clock rate:                             9751 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.1 / 11.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce RTX 3090 (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce RTX 3090 (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 11.1, NumDevs = 2
Result = PASS

提示: 若事先没有驱动,或者有但不适合3090显卡,一种方法是先不装3090,安装驱动,另一种是装上3090安装,但是后者会提示如下错误

在这里插入图片描述

在这里插入图片描述

此时,按 Ctrl+Alt+F2切换到命令行模式,注意不要按Ctrl+Alt+F1,否则用 sudo service lightdm stop 可能不能完全关闭 Xserver, 然后安装显示驱动时会报如下错误

在这里插入图片描述

[INFO]: Initializing menu
[INFO]: Silent install option: skipping toolkit
[INFO]: Silent install option: skipping samples
[INFO]: Silent install option: skipping toolkit
[INFO]: Silent install option: skipping toolkit
[INFO]: Components to install: 
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd  2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting

仅安装显示驱动

可以通过下载的cudatoolkit安装包安装(如 cuda_11.1.0_455.23.05_linux.run), 也可以单独下载显示驱动文件安装(如 NVIDIA-Linux-x86_64-455.45.01.run)

在这里插入图片描述

 ERROR: You appear to be running an X server; please exit X before installing.  For 
         further details, please see the section INSTALLING THE NVIDIA DRIVER in the 
         README available on the Linux driver download page at www.nvidia.com.     

提示没有完全关闭 Xserver. 使用 ps aux | grep X 查看, 确实可以发现相关进程, 如果用 sudo service lightdm stopsudo /etc/init.d/lightdm stopsudo /etc/init.d/gdm stop (如果是gdm桌面) 不能完全关闭 Xserver, 在刚开机快进入桌面, 弹出驱动错误对话框时, 按 Ctrl+Alt+F2切换到命令行模式,注意不要按Ctrl+Alt+F1. 然后重新安装即可.

仅安装cuda计算套件

参照 Ubuntu 16.04 LTS + CUDA8.0 + cudnn6.0 中的步骤安装即可, 安装时仅选择 Cuda Library.

安装Pytorch

安装pytorch1.7

注意: 目前(2020.1.6)conda库里还没有 cuda11.1 对应的pytorch, 所以如果下面的命令输入 cudatoolkit=11.1 则会下载cpu版的pytorch, 故此这里输入 cudatoolkit=11.0.

conda create -n rtx3090  # 创建新的环境 rtx3090
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

输入后会提示如下信息:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    ca-certificates-2020.12.5  |       ha878542_0         137 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    certifi-2020.12.5          |   py36h5fab9bb_0         143 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    cudatoolkit-11.0.3         |       h15472ef_6       952.1 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    dataclasses-0.7            |     pyhe4b4509_6          21 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    freetype-2.8.1             |       hfa320df_1         789 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    ld_impl_linux-64-2.35.1    |       hea4e1c9_1         617 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libffi-3.3                 |       h58526e2_2          51 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libgcc-ng-9.3.0            |      h5dbcf3e_17         7.8 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libpng-1.6.37              |       h21135ba_2         306 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libstdcxx-ng-9.3.0         |      h2ae2ef3_17         4.0 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libtiff-4.0.9              |       he6b73bb_1         521 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libuv-1.40.0               |       h7f98852_0         1.0 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    llvm-openmp-11.0.0         |       hfc4b9b4_1         2.8 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl-2020.4                 |     h726a3e6_304       215.6 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl-service-2.3.0          |   py36h8c4c3a4_2          54 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl_fft-1.2.0              |   py36h68bb277_1         164 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl_random-1.2.0           |   py36h7c3b610_1         314 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    ninja-1.10.2               |       h4bd325d_0         2.4 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    numpy-1.19.2               |   py36h54aff64_0          21 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
    numpy-base-1.19.2          |   py36hfa32c7d_0         5.2 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
    olefile-0.46               |     pyh9f0ad1d_1          32 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    openssl-1.1.1i             |       h7f98852_0         2.1 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    pillow-5.2.0               |           py36_0        1007 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    pip-20.3.3                 |     pyhd8ed1ab_0         1.1 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    python-3.6.12              |hffdb5ce_0_cpython        38.4 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    pytorch-1.7.1              |py3.6_cuda11.0.221_cudnn8.0.5_0       770.6 MB  pytorch
    setuptools-49.6.0          |   py36h9880bd3_2         947 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    sqlite-3.34.0              |       h74cdb3f_0         1.4 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    tk-8.6.10                  |       h21135ba_1         3.2 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    torchaudio-0.7.2           |             py36         9.8 MB  pytorch
    torchvision-0.8.2          |       py36_cu110        17.9 MB  pytorch
    typing_extensions-3.7.4.3  |             py_0          25 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    wheel-0.36.2               |     pyhd3deb0d_0          31 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    xz-5.2.5                   |       h516909a_1         343 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

安装完成后, 进入python 解释环境, 输入如下命令查看是否安装成功:

import torch
torch.__version__
torch.cuda.is_available()
torch.cuda.get_device_name(0)
torch.cuda.get_device_name(1)

本人在Python原生环境下配置有 pytorch1.6+cuda10.1 环境, Anaconda下创建的 rtx3090 环境下配置有 pytorch1.7.1+cuda11.0, 使用上述命令查看配置后的环境,可得到如下结果:

$ cuda10
$ python
Python 3.6.11 (default, Jun 29 2020, 05:15:03) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.6.0+cu101'
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'GeForce RTX 3090'
>>> torch.cuda.get_device_name(1)
'GeForce GTX 1080 Ti'
>>> 
$ cuda11
switch to cuda 11!
$ inconda rtx3090
Switch to rtx3090
(rtx3090) -----$ python
Python 3.6.12 | packaged by conda-forge | (default, Dec  9 2020, 00:36:02) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.7.1'
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'GeForce RTX 3090'
>>> torch.cuda.get_device_name(1)
'GeForce GTX 1080 Ti'
>>> 

源码安装pytorch1.8

参照官方步骤, 首先下载pytorch,torchvision源码:

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive

在conda中创建新环境 pytorch18

conda create -n pytorch18 python=3.7.9 # 创建新的环境 pytorch18

安装公共依赖:

conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
# Add LAPACK support for the GPU if needed
conda install -c pytorch magma-cuda111  # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo

执行如下命令编译安装:

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install

接下来是漫长的等待过程.

如果提示 Parse error. Expected a command name, got unquoted argument with text, 可能是你的pytorch中CMakeList.txt的编码方式不对,这有在你从Windows拷贝到Ubuntu时会发生, 修改为utf-8编码即可.

如果提示找不到CUDNN版本(Found cuDNN: v?), 如下,请检查CUDNN的安装过程,应该出错了,参见官方步骤

-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found cuDNN: v?  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Error at 
..... /public/cuda.cmake:... (message):
  PyTorch requires cuDNN 7 and above.

源码安装torchvision

git clone --recursive https://github.com/pytorch/vision.git
cd vision
python setup.py install

RTX3090性能问题

深度学习

  • RTX 3090 Benchmarks for Deep Learning – NVIDIA RTX 3090 vs 2080 Ti vs TITAN RTX vs RTX 6000/8000
  • Titan RTX vs RTX 3090 Transformer Benchmarks, Pytorch
  • Convolution operations are extremely slow on RTX 30 series GPU

Pytorch上测试结果

不同卷积类型

测试1维卷积,2维卷积以及2维卷积中的1维卷积,在benchmark 和 deterministic取不同值时的性能,此测试仅做前向传播,不做反向传播,测试代码如下

import torch
import torch.nn as nn
import time



device = 'cuda:0'
device = 'cuda:1'

niters = 1000

print("Torch version: ", torch.__version__)
print("Torch CUDA version: ", torch.version.cuda)
print("CUDNN Version: ", torch.backends.cudnn.version())
print(torch.cuda.get_device_name(int(device[-1])))

def profile(model, x, benchmark, deterministic, nb_iters):
    torch.backends.cudnn.benchmark = benchmark
    torch.backends.cudnn.deterministic = deterministic

    # warmup
    for _ in range(10):
        out = model(x)

    torch.cuda.synchronize()
    t0 = time.time()
    for _ in range(nb_iters):
        out = model(x)
    torch.cuda.synchronize()
    t1 = time.time()

    return (t1 - t0) / nb_iters


model1 = nn.Sequential(
    nn.Conv1d(24, 256, kernel_size=(12,), stride=(6,), groups=4),
    nn.ReLU(),
    nn.Conv1d(256, 256, kernel_size=(6,), stride=(3,), padding=(2,), groups=4),
    nn.ReLU(),
    nn.Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,), groups=4),
    nn.ReLU(),
)

model1.to(device=device)

x = torch.randn(64, 24, 224, device=device)

time0 = profile(model1, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv1d model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model1, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv1d model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model1, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv1d model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model1, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv1d model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))

model2 = nn.Sequential(
    nn.Conv2d(8, 32, kernel_size=(8, 8), stride=(4, 4)),
    nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
    nn.ReLU()
)
model2.to(device=device)

x = torch.randn(64, 8, 224, 224, device=device)
time0 = profile(model2, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv2d model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model2, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv2d model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model2, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv2d model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model2, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv2d model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))

model3 = nn.Sequential(
    nn.Conv2d(8, 32, kernel_size=(8, 1), stride=(4, 1)),
    nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=(4, 1), stride=(2, 1), padding=(1, 1)),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=(3, 1), stride=(1, 1), padding=(1, 1)),
    nn.ReLU()
)
model3.to(device=device)

x = torch.randn(64, 8, 224, 224, device=device)
time0 = profile(model3, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv2d1 model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model3, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv2d1 model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model3, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv2d1 model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model3, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv2d1 model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))

测试结果如下,由结果可知, 对于2维卷积, 3090比1080ti快了将近1倍, 对于1维卷积提升不大, 另外不同Torch版本也有一定的性能影响:

Torch version:  1.8.0.dev20210106+cu110
Torch CUDA version:  11.0
CUDNN Version:  8005
GeForce RTX 3090
Conv1d model, benchmark=False, deterministic=False, 0.687ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.511ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.540ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.484ms/iter
Conv2d model, benchmark=False, deterministic=False, 1.327ms/iter
Conv2d model, benchmark=True, deterministic=False, 1.335ms/iter
Conv2d model, benchmark=False, deterministic=True, 1.474ms/iter
Conv2d model, benchmark=True, deterministic=True, 1.480ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 3.278ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 3.280ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 3.286ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 3.286ms/iter

Torch version:  1.8.0.dev20210106+cu110
Torch CUDA version:  11.0
CUDNN Version:  8005
GeForce GTX 1080 Ti
Conv1d model, benchmark=False, deterministic=False, 0.709ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.711ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.844ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.711ms/iter
Conv2d model, benchmark=False, deterministic=False, 2.684ms/iter
Conv2d model, benchmark=True, deterministic=False, 2.883ms/iter
Conv2d model, benchmark=False, deterministic=True, 2.212ms/iter
Conv2d model, benchmark=True, deterministic=True, 2.195ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 5.583ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 6.077ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 6.097ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 6.120ms/iter

Torch version:  1.6.0+cu101
Torch CUDA version:  10.1
CUDNN Version:  7603
GeForce GTX 1080 Ti
Conv1d model, benchmark=False, deterministic=False, 0.544ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.542ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.544ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.542ms/iter
Conv2d model, benchmark=False, deterministic=False, 2.149ms/iter
Conv2d model, benchmark=True, deterministic=False, 2.332ms/iter
Conv2d model, benchmark=False, deterministic=True, 2.469ms/iter
Conv2d model, benchmark=True, deterministic=True, 2.483ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 6.097ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 6.637ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 6.658ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 6.679ms/iter

MNIST 分类

训练一个卷积神经网络, 并测试在测试集上的精度, 统计训练和测试耗时, 测试代码如下:

from __future__ import print_function
import argparse
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR

device = 'cuda:0'
#device = 'cuda:1'
num_workers = 1
num_workers = 4
batch_size = 64
epochs = 10
benchmark = True
benchmark = False
deterministic = True
#deterministic = False
cudaTF32 = True
#cudaTF32 = False
cudnnTF32 = True
#cudnnTF32 = False

print("Torch Version: ", torch.__version__)
print("Torch CUDA Version: ", torch.version.cuda)
print("CUDNN Version: ", torch.backends.cudnn.version())
print("GPU Device: ", torch.cuda.get_device_name(int(device[-1])))
print("CUDNN Benchmark: ", benchmark)
print("CUDNN Deterministic: ", deterministic)
print("CUDA TF32: ", cudaTF32)
print("CUDNN TF32: ", cudnnTF32)
print("Workers: ", num_workers)
print("Batch Size: ", batch_size)
print("Epochs: ", epochs)

torch.backends.cudnn.benchmark = benchmark
torch.backends.cudnn.deterministic = deterministic
#torch.backends.cuda.matmul.allow_tf32 = cudaTF32
#torch.backends.cudnn.allow_tf32 = cudnnTF32

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    train_loss = 0.
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
    train_loss /= len(train_loader.dataset)
    return train_loss
    

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    return test_loss

def main():
    global device, num_workers, batch_size, epochs
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=3, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=2020, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    torch.manual_seed(args.seed)

    device = torch.device(device if use_cuda else "cpu")
    args.batch_size = batch_size
    args.epochs = epochs

    kwargs = {'batch_size': args.batch_size}
    if use_cuda:
        kwargs.update({'num_workers': num_workers,
                       'pin_memory': True,
                       'shuffle': True},
                     )

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    dataset1 = datasets.MNIST('../data', train=True, download=True,
                       transform=transform)
    dataset2 = datasets.MNIST('../data', train=False,
                       transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset1,**kwargs)
    test_loader = torch.utils.data.DataLoader(dataset2, **kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    tstart = time.time()
    train_loss, test_loss = 0., 0.
    for epoch in range(1, args.epochs + 1):
        train_loss += train(args, model, device, train_loader, optimizer, epoch)
        test_loss += test(model, device, test_loader)
        scheduler.step()
    tend = time.time()
    train_loss /= args.epochs
    test_loss /= args.epochs

    print("Training Loss: ", train_loss)
    print("Testing  Loss: ", test_loss)
    print("Time: %.4f" % (tend - tstart))
    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
    main()

测试结果如下, 由测试结果知, 3090与1080ti性能相当, 甚至还没有1080ti好, 这与官网宣称的性能差距甚远, 另外本人在自己写的比较复杂的网络模型上测试, 3090的性能更差.

Torch Version:  1.8.0.dev20210106+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDNN Benchmark:  False
CUDNN Deterministic:  True
CUDA TF32:  True
CUDNN TF32:  True
Workers:  4
Batch Size:  64
Epochs:  10
Training Loss:  0.0009187350056997093
Testing  Loss:  0.03026894662413977
Time: 63.0998



Torch Version:  1.8.0.dev20210106+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDNN Benchmark:  False
CUDNN Deterministic:  True
CUDA TF32:  True
CUDNN TF32:  True
Workers:  4
Batch Size:  64
Epochs:  10
Training Loss:  0.0008879667648342487
Testing  Loss:  0.030606915746741615
Time: 56.9057

Torch Version:  1.6.0+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDNN Benchmark:  False
CUDNN Deterministic:  True
Workers:  4
Batch Size:  64
Epochs:  10
Training Loss:  0.0009102054347947707
Testing  Loss:  0.029809928882313725
Time: 52.8144

由于安装的Pytorch为二进制版, 其对应cuda版本为11.0, 而本文安装cuda版本为11.1, 担心这个会影响3090性能的发挥, 因而重新安装cuda11.0, 测试结果如下, 可见cuda版本没有影响

RTX3090 Pytorch1.8 MNIST 运行3次
RTX3090 Pytorch1.8 MNIST
RTX1080ti Pytorch1.8 MNIST 运行3次
RTX1080ti Pytorch1.8 MNIST

Tensorflow 上测试结果

CFAR图像分类

所用Tensorflow版本为2.4.0, CUDA为11.0, CUDNN为8005, 数据集为CFAR10, 测试代码如下:

主文件

import os
import time
import tensorflow as tf

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    # The CIFAR labels happen to be arrays, 
    # which is why you need the extra index
    plt.xlabel(class_names[train_labels[i][0]])
plt.show()

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

model.summary()

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

tstart = time.time()
history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))
tend = time.time()
print("Training time: ", tend - tstart)

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')

tstart = time.time()
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
tend = time.time()
print("Testing time: ", tend - tstart)

print(test_acc)

测试结果如下, 可见, 在Tensorflow下, 3090的性能也很一般, 跟1080ti差不多:

2021-01-09 22:49:35.037113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22113 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)

2021-01-09 22:49:36.122102: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-09 22:49:36.122543: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2298505000 Hz
Epoch 1/10
2021-01-09 22:49:36.553517: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-09 22:49:37.349985: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-09 22:49:37.353723: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-09 22:49:39.345531: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

1563/1563 [==============================] - 13s 6ms/step - loss: 1.7480 - accuracy: 0.3552 - val_loss: 1.2966 - val_accuracy: 0.5368
Epoch 2/10
1563/1563 [==============================] - 16s 11ms/step - loss: 1.1864 - accuracy: 0.5776 - val_loss: 1.1261 - val_accuracy: 0.6031
Epoch 3/10
1563/1563 [==============================] - 9s 6ms/step - loss: 1.0160 - accuracy: 0.6462 - val_loss: 0.9643 - val_accuracy: 0.6648
Epoch 4/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8975 - accuracy: 0.6860 - val_loss: 0.9399 - val_accuracy: 0.6661
Epoch 5/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8137 - accuracy: 0.7145 - val_loss: 0.9458 - val_accuracy: 0.6683
Epoch 6/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.7547 - accuracy: 0.7366 - val_loss: 0.8510 - val_accuracy: 0.7013
Epoch 7/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6963 - accuracy: 0.7557 - val_loss: 0.8670 - val_accuracy: 0.7034
Epoch 8/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6321 - accuracy: 0.7779 - val_loss: 0.8671 - val_accuracy: 0.7068
Epoch 9/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6121 - accuracy: 0.7854 - val_loss: 0.8556 - val_accuracy: 0.7122
Epoch 10/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.5720 - accuracy: 0.7980 - val_loss: 0.8800 - val_accuracy: 0.7110
Training time:  100.08890771865845
313/313 - 1s - loss: 0.8800 - accuracy: 0.7110
Testing time:  0.9198315143585205
0.7110000252723694


-----------------------------------------------------------------------


2021-01-09 22:53:03.004101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10269 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)

2021-01-09 22:53:03.788911: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-09 22:53:03.789360: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2298505000 Hz
Epoch 1/10
2021-01-09 22:53:04.196189: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-09 22:53:04.408018: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-09 22:53:04.410362: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
1563/1563 [==============================] - 12s 6ms/step - loss: 1.7363 - accuracy: 0.3623 - val_loss: 1.3144 - val_accuracy: 0.5389
Epoch 2/10
1563/1563 [==============================] - 10s 6ms/step - loss: 1.1831 - accuracy: 0.5816 - val_loss: 1.0454 - val_accuracy: 0.6328
Epoch 3/10
1563/1563 [==============================] - 9s 6ms/step - loss: 1.0309 - accuracy: 0.6423 - val_loss: 0.9749 - val_accuracy: 0.6596
Epoch 4/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.9134 - accuracy: 0.6766 - val_loss: 0.9642 - val_accuracy: 0.6651
Epoch 5/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8405 - accuracy: 0.7080 - val_loss: 0.9484 - val_accuracy: 0.6706
Epoch 6/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.7702 - accuracy: 0.7287 - val_loss: 0.8654 - val_accuracy: 0.7013
Epoch 7/10
1563/1563 [==============================] - 10s 6ms/step - loss: 0.7279 - accuracy: 0.7445 - val_loss: 0.8597 - val_accuracy: 0.7013
Epoch 8/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6909 - accuracy: 0.7604 - val_loss: 0.9126 - val_accuracy: 0.6914
Epoch 9/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6469 - accuracy: 0.7717 - val_loss: 0.9200 - val_accuracy: 0.6951
Epoch 10/10
1563/1563 [==============================] - 10s 6ms/step - loss: 0.6022 - accuracy: 0.7892 - val_loss: 0.8853 - val_accuracy: 0.7042
Training time:  97.05110216140747
313/313 - 1s - loss: 0.8853 - accuracy: 0.7042
Testing time:  1.0514824390411377
0.704200029373169

需要注意的事项

Tensor Float32

根据Pytorch文档所述 TF32 on Ampere, 30系列的GPU支持Tensor Float32类型的计算,如果打开则会使用相关加速单元计算,要比普通的浮点数计算快,但是精度会下降, 在Pytorch中默认是打开的。

# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

奇怪的现象

基础环境为:

  • 系统: Ubuntu16.04
  • CUDA: 8.0, 9.0, 10.1, 11.1 共存

第一种Pytorch版本配置:

  • 原生Python环境Pytorch配置 1.6.0+cu101
  • Anaconda环境Pytorch配置: 1.8.0.dev20210106+cu110

第二种Pytorch版本配置:

  • 原生Python环境Pytorch配置 1.9.0.dev20210208+cu110 + Python3.6.11
  • Anaconda环境Pytorch配置: 1.8.0.dev20210208+cu110 + Python3.7.3

第三种Pytorch版本配置:

  • 原生Python环境Pytorch配置 1.8.0.dev20210208+cu110 + Python3.6.11
  • Anaconda环境Pytorch配置: 1.8.0.dev20210208+cu110 + Python3.7.3

最后发现只要 原生Python中的Pytorch版本与Anaconda中的Pytorch版本不一致, 一在GPU上运行程序就会卡死, 只有版本一致时才不卡.

Pytorch 不同版本在不同设备上的性能测试

  1. 使用 torch.manual_seed(seed) 设置随机数种子, 保证每次运行产生的网络初始权重相同
  2. torch.backends.cudnn.benchmark 是否允许CUDNN自己寻找较快的卷积实现, 不同device, 不同的卷积, 都会带来卷积实现的速度和精度的差异, 若网络结构非动态, 且数据大小不变化, 可设置为 True, 反之, 应设为 False, 否则反而寻找快速实现会暂用大量时间.
  3. torch.backends.cudnn.deterministic 是否禁止CUDNN使用不确定性算法, 若为 True 则使用确定性算法, 此设置即影响速度又影响精度.
  4. torch.backends.cuda.matmul.allow_tf32 是否允许使用 TensorFloat32 (TF32) 张量核, 设置为 True提升速度, 但精度会有损失, 仅ampere 架构GPU支持, 对于不支持的GPU, 此设置不奏效, 不影响
  5. torch.backends.cudnn.allow_tf32 是否允许CUDNN使用 TensorFloat32 (TF32) 张量核, 设置为 True提升速度, 但精度会有损失, 仅ampere 架构GPU支持, 对于不支持的GPU, 此设置不奏效, 不影响
  • pytorch, nvidia ampere tensor cores :speed vs precision, 精度损失挺大的.

为便于比较, 现将时间测试结果如下:


PyTorch 1.8.0.dev20210208+cu110 + RTX 3090+CUDA11

1. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
2. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True : train: 55s, valid: 14s
3. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
4. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True : train: 11s, valid: 8s
5. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
6. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True : train: 54s, valid: 14s
7. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
8. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True : train: 11s, valid: 8s

PyTorch 1.8.0.dev20210208+cu110 + GTX 1080TI+CUDA11

1. Benchmark=False, Deterministic=False: train: 21s, valid: 20s
2. Benchmark=False, Deterministic=True: train: 10s, valid: 10s
3. Benchmark=True, Deterministic=False: train: 18s, valid: 18s
4. Benchmark=True, Deterministic=True: train: 25s, valid: 18s


PyTorch 1.8.0.dev20210210+cu101 + GTX 1080TI+CUDA10

1. Benchmark=False, Deterministic=False: train: 29s, valid: 20s
2. Benchmark=False, Deterministic=True: train: 24s, valid: 20s
3. Benchmark=True, Deterministic=False: train: 17s, valid: 19s
4. Benchmark=True, Deterministic=True: train: 10s, valid: 10s

3090+CUDA11结果

测试配置如下:

  1. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False 运行2次
  2. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True 运行2次
  3. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False 运行2次
  4. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True 运行2次
  5. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False 运行2次
  6. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True 运行2次
  7. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False 运行2次
  8. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True 运行2次

由结果可得如下结论:

  • 无论 Benchmark, Deterministic 取何种设置, TF32基本无加速, 而由于精度损失导致指标有所下降.
  • Benchmark 加速比较大 (无论 CUDATF32, CUDNNTF32, Deterministic 取何值)
  • Deterministic 加速比很小 (无论 CUDATF32, CUDNNTF32, Benchmark 取何值)

时间结果如下:

  1. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
  2. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True : train: 55s, valid: 14s
  3. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
  4. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True : train: 11s, valid: 8s
  5. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
  6. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True : train: 54s, valid: 14s
  7. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
  8. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True : train: 11s, valid: 8s

具体结果如下:

  1. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7454, time: 51.87094736099243s
--->Valid epoch: 1, loss: 10.0915, entropy: 10.0915, l1norm: 10.0175, contrast: -3.6298, time: 14.616422176361084s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7351, time: 53.827654123306274s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6338, time: 14.419052362442017s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0174, contrast: -3.7504, time: 54.31160569190979s
--->Valid epoch: 1, loss: 10.0903, entropy: 10.0903, l1norm: 10.0172, contrast: -3.6331, time: 14.683520078659058s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7414, time: 54.10149121284485s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6291, time: 14.825896978378296s
  1. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7438, time: 55.177640199661255s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6328, time: 14.768057823181152s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7462, time: 55.64263844490051s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6270, time: 14.934310674667358s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7438, time: 54.337199211120605s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6328, time: 14.743611574172974s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7462, time: 55.72549605369568s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6270, time: 14.936981439590454s
  1. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False 运行2次
GeForce RTX 3090
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0174, contrast: -3.7344, time: 9.696520328521729s
--->Valid epoch: 1, loss: 10.0903, entropy: 10.0903, l1norm: 10.0172, contrast: -3.6324, time: 8.027890682220459s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7427, time: 9.307999849319458s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6288, time: 7.912508487701416s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0175, contrast: -3.7390, time: 10.154499053955078s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6319, time: 7.96985650062561s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0185, contrast: -3.7424, time: 10.21190619468689s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6280, time: 8.24509072303772s
  1. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7376, time: 10.749010562896729s
--->Valid epoch: 1, loss: 10.0892, entropy: 10.0892, l1norm: 10.0169, contrast: -3.6317, time: 8.022538900375366s
--->Train epoch: 2, loss: 10.0845, entropy: 10.0845, l1norm: 10.0184, contrast: -3.7403, time: 11.13167119026184s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6277, time: 8.124176025390625s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7376, time: 11.482399225234985s
--->Valid epoch: 1, loss: 10.0892, entropy: 10.0892, l1norm: 10.0169, contrast: -3.6317, time: 8.082707166671753s
--->Train epoch: 2, loss: 10.0845, entropy: 10.0845, l1norm: 10.0184, contrast: -3.7403, time: 11.133780717849731s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6277, time: 7.9467527866363525s
  1. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0174, contrast: -3.7520, time: 53.30528664588928s
--->Valid epoch: 1, loss: 10.0915, entropy: 10.0915, l1norm: 10.0175, contrast: -3.6306, time: 14.644577264785767s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7400, time: 53.856117486953735s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6295, time: 14.902146816253662s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7454, time: 52.86871576309204s
--->Valid epoch: 1, loss: 10.0921, entropy: 10.0921, l1norm: 10.0176, contrast: -3.6283, time: 14.82244610786438s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7385, time: 54.21399688720703s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6301, time: 14.981997728347778s
  1. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7479, time: 53.832154750823975s
--->Valid epoch: 1, loss: 10.0914, entropy: 10.0914, l1norm: 10.0175, contrast: -3.6294, time: 14.578207015991211s
--->Train epoch: 2, loss: 10.0850, entropy: 10.0850, l1norm: 10.0185, contrast: -3.7328, time: 55.1286780834198s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6291, time: 14.752575635910034s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7479, time: 54.217995166778564s
--->Valid epoch: 1, loss: 10.0914, entropy: 10.0914, l1norm: 10.0175, contrast: -3.6294, time: 14.726275444030762s
--->Train epoch: 2, loss: 10.0850, entropy: 10.0850, l1norm: 10.0185, contrast: -3.7328, time: 58.631773948669434s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6291, time: 15.65324854850769s
  1. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7399, time: 9.78132176399231s
--->Valid epoch: 1, loss: 10.0895, entropy: 10.0895, l1norm: 10.0170, contrast: -3.6324, time: 7.887274980545044s
--->Train epoch: 2, loss: 10.0848, entropy: 10.0848, l1norm: 10.0185, contrast: -3.7401, time: 9.359825134277344s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6280, time: 7.946653127670288s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0838, entropy: 10.0838, l1norm: 10.0174, contrast: -3.7392, time: 9.73360824584961s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6307, time: 7.910605192184448s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7455, time: 9.480495691299438s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6302, time: 8.091880798339844s
  1. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7435, time: 10.862586498260498s
--->Valid epoch: 1, loss: 10.0924, entropy: 10.0924, l1norm: 10.0177, contrast: -3.6264, time: 7.919918537139893s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7405, time: 10.470875024795532s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6289, time: 7.927625894546509s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7435, time: 10.894693613052368s
--->Valid epoch: 1, loss: 10.0924, entropy: 10.0924, l1norm: 10.0177, contrast: -3.6264, time: 7.909468173980713s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7405, time: 10.432015419006348s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6289, time: 7.901139974594116s

1080TI+CUDA11

首先看 CUDA TF32CUDNN TF32 在1080ti上是否起作用, 由下面结果可知不起作用.

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.39651656150818s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.37838625907898s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 60.98785209655762s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.49517011642456s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.536439895629883s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.666481971740723s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.99992823600769s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.800203800201416s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.48216986656189s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.80778193473816s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.893065929412842s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.811289072036743s

下面测试 Benchmark, Deterministic 的影响, 配置如下

  1. Benchmark=False, Deterministic=False 运行2次
  2. Benchmark=False, Deterministic=True 运行2次
  3. Benchmark=True, Deterministic=False 运行2次
  4. Benchmark=True, Deterministic=True 运行2次

下面给出时间结果:

  1. Benchmark=False, Deterministic=False: train: 21s, valid: 20s
  2. Benchmark=False, Deterministic=True: train: 10s, valid: 10s
  3. Benchmark=True, Deterministic=False: train: 18s, valid: 18s
  4. Benchmark=True, Deterministic=True: train: 25s, valid: 18s

具体结果如下:

  1. Benchmark=False, Deterministic=False 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7399, time: 20.35549545288086s
--->Valid epoch: 1, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6324, time: 18.916125535964966s
--->Train epoch: 2, loss: 10.0855, entropy: 10.0855, l1norm: 10.0186, contrast: -3.7225, time: 22.069878578186035s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6277, time: 20.496481895446777s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7452, time: 20.829981803894043s
--->Valid epoch: 1, loss: 10.0902, entropy: 10.0902, l1norm: 10.0172, contrast: -3.6316, time: 20.57017183303833s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0183, contrast: -3.7417, time: 22.458086252212524s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6297, time: 20.48271942138672s
  1. Benchmark=False, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.37249255180359s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.48044991493225s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7361, time: 61.695250272750854s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6316, time: 20.484682321548462s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.24735903739929s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.471797704696655s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7361, time: 61.43708038330078s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6316, time: 20.503613471984863s
  1. Benchmark=True, Deterministic=False 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0841, entropy: 10.0841, l1norm: 10.0176, contrast: -3.7429, time: 18.17175269126892s
--->Valid epoch: 1, loss: 10.0923, entropy: 10.0923, l1norm: 10.0177, contrast: -3.6254, time: 18.15757131576538s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7369, time: 18.792244911193848s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6295, time: 18.823336124420166s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0175, contrast: -3.7487, time: 18.5258526802063s
--->Valid epoch: 1, loss: 10.0930, entropy: 10.0930, l1norm: 10.0178, contrast: -3.6256, time: 18.816319704055786s
--->Train epoch: 2, loss: 10.0852, entropy: 10.0852, l1norm: 10.0185, contrast: -3.7307, time: 18.378268718719482s
--->Valid epoch: 2, loss: 10.0885, entropy: 10.0885, l1norm: 10.0168, contrast: -3.6320, time: 18.800978660583496s
  1. Benchmark=True, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.72299075126648s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.807397603988647s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.79756498336792s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.80388379096985s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.729687929153442s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.8100643157959s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.809940576553345s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.82235860824585s

1080TI+CUDA10

  1. Benchmark=False, Deterministic=False 运行2次
  2. Benchmark=False, Deterministic=True 运行2次
  3. Benchmark=True, Deterministic=False 运行2次
  4. Benchmark=True, Deterministic=True 运行2次

下面给出时间结果:

  1. Benchmark=False, Deterministic=False: train: 29s, valid: 20s
  2. Benchmark=False, Deterministic=True: train: 24s, valid: 20s
  3. Benchmark=True, Deterministic=False: train: 17s, valid: 19s
  4. Benchmark=True, Deterministic=True: train: 10s, valid: 10s

具体结果如下:

  1. Benchmark=False, Deterministic=False 运行2次
Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7381, time: 28.030856609344482s
--->Valid epoch: 1, loss: 10.0925, entropy: 10.0925, l1norm: 10.0177, contrast: -3.6266, time: 19.793615102767944s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7522, time: 29.9300274848938s
--->Valid epoch: 2, loss: 10.0899, entropy: 10.0899, l1norm: 10.0172, contrast: -3.6343, time: 20.407748699188232s


GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7364, time: 28.959442138671875s
--->Valid epoch: 1, loss: 10.0906, entropy: 10.0906, l1norm: 10.0173, contrast: -3.6312, time: 20.42161989212036s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7435, time: 29.956018924713135s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6298, time: 20.411144495010376s
  1. Benchmark=False, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0173, contrast: -3.7488, time: 23.54142165184021s
--->Valid epoch: 1, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6269, time: 20.411819219589233s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0187, contrast: -3.7421, time: 24.92523694038391s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6279, time: 20.398447513580322s

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0173, contrast: -3.7488, time: 23.943784952163696s
--->Valid epoch: 1, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6269, time: 20.42457938194275s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0187, contrast: -3.7421, time: 25.094601154327393s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6279, time: 20.471126079559326s
  1. Benchmark=True, Deterministic=False 运行2次
Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0175, contrast: -3.7485, time: 16.708702325820923s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6340, time: 19.010523319244385s
--->Train epoch: 2, loss: 10.0838, entropy: 10.0838, l1norm: 10.0183, contrast: -3.7436, time: 17.553938627243042s
--->Valid epoch: 2, loss: 10.0883, entropy: 10.0883, l1norm: 10.0167, contrast: -3.6318, time: 19.111060619354248s

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0840, entropy: 10.0840, l1norm: 10.0175, contrast: -3.7400, time: 16.517553091049194s
--->Valid epoch: 1, loss: 10.0916, entropy: 10.0916, l1norm: 10.0175, contrast: -3.6292, time: 18.997257232666016s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7379, time: 17.554461240768433s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6306, time: 19.144949674606323s
  1. Benchmark=True, Deterministic=True 运行2次
Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0839, entropy: 10.0839, l1norm: 10.0175, contrast: -3.7339, time: 20.768550872802734s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6320, time: 19.122512578964233s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7527, time: 21.37337899208069s
--->Valid epoch: 2, loss: 10.0892, entropy: 10.0892, l1norm: 10.0170, contrast: -3.6347, time: 19.099023818969727s

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0175, contrast: -3.7441, time: 21.45243787765503s
--->Valid epoch: 1, loss: 10.0929, entropy: 10.0929, l1norm: 10.0178, contrast: -3.6257, time: 19.00617289543152s
--->Train epoch: 2, loss: 10.0854, entropy: 10.0854, l1norm: 10.0185, contrast: -3.7285, time: 21.43665385246277s
--->Valid epoch: 2, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6309, time: 19.1219220161438s
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Ubuntu16.04 + NVIDIA RTX3090 + Pytorch + Tensorflow 的相关文章

随机推荐

  • moviepy快速视频转图片

    功能如标题 xff0c 代码如下 xff1a span class token keyword import span os span class token keyword import span numpy span class tok
  • 获取onnx模型中权重并画出分布图

    如下为代码 xff0c 主要应用场景是在做模型量化处理时 xff0c 常见的量化是int8 int16 如果数据分布不合适会存在较大的量化精度损失 比如int8 xff0c 希望权得的分布是在 128 127之间 span class to
  • C# Newtonsoft.Json JObject移除属性,在序列化时忽略

    一 针对 单个 对象移除属性 xff0c 序列化时忽略处理 JObject实例的 Remove 方法 xff0c 可以在 指定序列化时移除属性和值 示例如下 xff1a json 序列化 JObject obj1 61 JObject Fr
  • 2016,梦想起航

    2016 xff0c 梦想起航 10 9 8 7 6 5 4 3 2 1 xff0c 新年快乐 xff01 xff0c 伴随着跨年晚会上各位主持人的新年祝福 xff0c 2017年的大幕正式开启 xff0c 2016年的挂历已经发黄 xff
  • 基础篇——Linux和树莓派发行版以及raspbian、ubuntu、debian、ros的关系

    背景故事 初学Linux经常会听到Linux ubuntu debian raspbian centos等等名词 xff0c 它们之间是什么关系 xff0c 傻傻分不清楚 xff0c 这里摘录一些书上的内容 xff0c 理清楚它们的关系 L
  • 中级篇——树莓派系统备份恢复的两种方式

    树莓派系统备份恢复 方式一 xff1a 使用树莓派烧录工具 xff0c Win32DiskImager 工具的读取功能 xff0c 如下图 优点是操作简单 xff0c 缺点也很明显 xff0c 因为是全盘读取 xff0c 所以读取的备份文件
  • 利用实例学CMMI V2.0 (1)

    越来越多客户询问关于CMMI v2 0 xff0c 而且这个模型不像v1 3可以免费下载 xff0c 所以我们需要一些辅助资料 xff0c 帮一些有兴趣的人预先了解 xff0c 尤其是已经学过v1 3的 xff0c 可以在此基础上学习v2
  • Ubuntu 16.04升级python3.6及解决终端打不开的bug

    Ubuntu 16 04 默认安装python3的版本为python3 5 xff0c 而在一些场景下我们需要用到python3 6 xff0c 于是本人尝试将python3 5升级为python3 6 xff0c 但期间出现了界面模式下终
  • Dockerfile 指令详细介绍

    使用 Dockerfile 定制镜像 这里仅讲解如何运行 Dockerfile 文件来定制一个镜像 具体 Dockerfile 文件内指令详解 xff0c 将在下一节中介绍 xff0c 这里你只要知道构建的流程即可 下面以定制一个nginx
  • linux系统下cat命令的使用

    运维那些事 2017 01 22 21 15 cat命令是Linux系统下查看文件内容用的指令 xff0c 还可以将显示的信息转入或附加到文件上 命令格式 cat 选项 文件 命令功能 cat主要有三大功能 xff1a 一次显示整个文件 c
  • 使用Python爬取淘宝两千款套套

    各位同学们 xff0c 好久没写原创技术文章了 xff0c 最近有些忙 xff0c 所以进度很慢 xff0c 给大家道个歉 gt 警告 xff1a 本教程仅用作学习交流 xff0c 请勿用作商业盈利 xff0c 违者后果自负 xff01 如
  • 大学那会儿,我读过的技术经典图书

    我2009年考进大学的计算机系 上大学前我对电脑的使用仅限于上QQ xff0c 看小说 xff0c 可以说是零基础 但通过三年的努力及对计算机专业的喜爱 xff0c 我顺利保送到自己梦寐以求的学校攻读研究生 大学期间看了不少书 xff0c
  • vscode 配置 git (配置、暂存、推送、拉取、免密)

    前些天发现了一个巨牛的人工智能学习网站 xff0c 通俗易懂 xff0c 风趣幽默 xff0c 忍不住分享一下给大家 点击跳转到教程 vscode 中对 git 进行了集成 xff0c 很多操作只需点击就能操作 xff0c 无需写一些 gi
  • 已知子网掩码,确定ip地址范围

    主要是把 ip地址和子网掩码在二进制下进行对比 ip地址分成两个部分 xff0c 网络号和主机号 凡是在子网掩码对比下 xff0c 1 代表了网络号 xff0c 0 代表了主机号 然后对比后 xff0c 把主机号最小 xff08 全0 xf
  • virmach主机购买和使用

    01购买 参考教程 xff1a https www jb51 net yunying 470007 html 需要注意的是购买后 xff0c 登录的帐号和密码会发送到默认邮箱中 xff0c 拿到用户名密码可以先使用ssh登录 xff0c 查
  • Xmanager使用方法

    服务器 xff1a CentOS 7 6 GNOME桌面环境 xff08 若最小化安装 xff0c 默认是无桌面的 xff0c 那么就要安装桌面 xff0c 参考百度 xff09 个人主机 xff1a Windows 10专业版 xff0c
  • 制作便携式随身系统(以Ubuntu为例)

    文章目录 说明准备硬件软件 启动盘制作安装Ubuntu到随身存储设备概述 为便携式系统创建GRUB引导安装 GRUB2 到U盘或移动硬盘制作Grub引导菜单 问题与解决拔掉U盘进不了原来的系统随身系统无法在别的电脑上启动welcome to
  • 谈谈了解的几个专业

    本文涉及专业 智能科学与工程通信工程电子信息工程集成电路设计与集成系统微电子科学与工程计算机科学与技术电磁场与无线技术遥感科学与技术 这些学科专业 xff0c 很多之间是相互交叉的 以下仅作参考 xff01 xff01 xff01 随便聊聊
  • 深度学习平台框架

    简介 分类 模型转换 网络参数转到MAT 文件 keras权重到mat 可知直接用matlab读取hdf5文件 也可以通过如下脚本 keras2mat py 转换 span class token comment usr bin env p
  • Ubuntu16.04 + NVIDIA RTX3090 + Pytorch + Tensorflow

    文章目录 说明有用链接显卡驱动安装文件下载一次性安装显示驱动和cuda计算套件仅安装显示驱动仅安装cuda计算套件 安装Pytorch安装pytorch1 7源码安装pytorch1 8源码安装torchvision RTX3090性能问题