Ubuntu16.04 + NVIDIA RTX3090 + Pytorch + Tensorflow

2023-05-16

文章目录

说明
有用链接
显卡驱动安装
- 文件下载
- 一次性安装显示驱动和cuda计算套件
- 仅安装显示驱动
- 仅安装cuda计算套件
安装Pytorch
- 安装pytorch1.7
- 源码安装pytorch1.8
- 源码安装torchvision
RTX3090性能问题
- 深度学习
- Pytorch上测试结果
- - 不同卷积类型
  - MNIST 分类
- Tensorflow 上测试结果
- - CFAR图像分类
- 需要注意的事项
- - Tensor Float32
奇怪的现象
Pytorch 不同版本在不同设备上的性能测试
- 3090+CUDA11结果
- 1080TI+CUDA11
- 1080TI+CUDA10

说明

记录RTX3090显卡显示驱动与cuda计算驱动安装过程, 本文均采用 run 格式的安装文件.

有用链接

pytorch二进制包下载

显卡驱动安装

文件下载

cuda 从这里下载安装文件 cuda_11.1.0_455.23.05_linux.run
driver 从这里下载安装文件 NVIDIA-Linux-x86_64-455.45.01.run, NVIDIA-Linux-x86_64-455.23.04.run

一次性安装显示驱动和cuda计算套件

参照 Ubuntu 16.04 LTS + CUDA8.0 + cudnn6.0 中的步骤安装即可.

安装完成后重启,进入/usr/local/cuda-11.1/samples/1_Utilities/deviceQuery安装目录, 执行 sudo make 命令,接着执行 ./deviceQuery 命令查看设备及驱动等信息.

/usr/local/cuda-11.1/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce RTX 3090"
  CUDA Driver Version / Runtime Version          11.1 / 11.1
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24265 MBytes (25443893248 bytes)
  (82) Multiprocessors, (128) CUDA Cores/MP:     10496 CUDA Cores
  GPU Max Clock rate:                            1785 MHz (1.78 GHz)
  Memory Clock rate:                             9751 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.1 / 11.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce RTX 3090 (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce RTX 3090 (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 11.1, NumDevs = 2
Result = PASS

提示: 若事先没有驱动,或者有但不适合3090显卡,一种方法是先不装3090,安装驱动,另一种是装上3090安装,但是后者会提示如下错误

在这里插入图片描述

此时,按 Ctrl+Alt+F2切换到命令行模式,注意不要按Ctrl+Alt+F1,否则用 sudo service lightdm stop 可能不能完全关闭 Xserver, 然后安装显示驱动时会报如下错误

在这里插入图片描述

[INFO]: Initializing menu
[INFO]: Silent install option: skipping toolkit
[INFO]: Silent install option: skipping samples
[INFO]: Silent install option: skipping toolkit
[INFO]: Silent install option: skipping toolkit
[INFO]: Components to install: 
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd  2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting

仅安装显示驱动

可以通过下载的cudatoolkit安装包安装(如 cuda_11.1.0_455.23.05_linux.run), 也可以单独下载显示驱动文件安装(如 NVIDIA-Linux-x86_64-455.45.01.run)

在这里插入图片描述

 ERROR: You appear to be running an X server; please exit X before installing.  For 
         further details, please see the section INSTALLING THE NVIDIA DRIVER in the 
         README available on the Linux driver download page at www.nvidia.com.

提示没有完全关闭 Xserver. 使用 ps aux | grep X 查看, 确实可以发现相关进程, 如果用 sudo service lightdm stop 或 sudo /etc/init.d/lightdm stop 或 sudo /etc/init.d/gdm stop (如果是gdm桌面) 不能完全关闭 Xserver, 在刚开机快进入桌面, 弹出驱动错误对话框时, 按 Ctrl+Alt+F2切换到命令行模式,注意不要按Ctrl+Alt+F1. 然后重新安装即可.

仅安装cuda计算套件

参照 Ubuntu 16.04 LTS + CUDA8.0 + cudnn6.0 中的步骤安装即可, 安装时仅选择 Cuda Library.

安装Pytorch

安装pytorch1.7

注意: 目前(2020.1.6)conda库里还没有 cuda11.1 对应的pytorch, 所以如果下面的命令输入 cudatoolkit=11.1 则会下载cpu版的pytorch, 故此这里输入 cudatoolkit=11.0.

conda create -n rtx3090  # 创建新的环境 rtx3090
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

输入后会提示如下信息:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    ca-certificates-2020.12.5  |       ha878542_0         137 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    certifi-2020.12.5          |   py36h5fab9bb_0         143 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    cudatoolkit-11.0.3         |       h15472ef_6       952.1 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    dataclasses-0.7            |     pyhe4b4509_6          21 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    freetype-2.8.1             |       hfa320df_1         789 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    ld_impl_linux-64-2.35.1    |       hea4e1c9_1         617 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libffi-3.3                 |       h58526e2_2          51 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libgcc-ng-9.3.0            |      h5dbcf3e_17         7.8 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libpng-1.6.37              |       h21135ba_2         306 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libstdcxx-ng-9.3.0         |      h2ae2ef3_17         4.0 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libtiff-4.0.9              |       he6b73bb_1         521 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    libuv-1.40.0               |       h7f98852_0         1.0 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    llvm-openmp-11.0.0         |       hfc4b9b4_1         2.8 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl-2020.4                 |     h726a3e6_304       215.6 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl-service-2.3.0          |   py36h8c4c3a4_2          54 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl_fft-1.2.0              |   py36h68bb277_1         164 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    mkl_random-1.2.0           |   py36h7c3b610_1         314 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    ninja-1.10.2               |       h4bd325d_0         2.4 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    numpy-1.19.2               |   py36h54aff64_0          21 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
    numpy-base-1.19.2          |   py36hfa32c7d_0         5.2 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
    olefile-0.46               |     pyh9f0ad1d_1          32 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    openssl-1.1.1i             |       h7f98852_0         2.1 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    pillow-5.2.0               |           py36_0        1007 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    pip-20.3.3                 |     pyhd8ed1ab_0         1.1 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    python-3.6.12              |hffdb5ce_0_cpython        38.4 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    pytorch-1.7.1              |py3.6_cuda11.0.221_cudnn8.0.5_0       770.6 MB  pytorch
    setuptools-49.6.0          |   py36h9880bd3_2         947 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    sqlite-3.34.0              |       h74cdb3f_0         1.4 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    tk-8.6.10                  |       h21135ba_1         3.2 MB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    torchaudio-0.7.2           |             py36         9.8 MB  pytorch
    torchvision-0.8.2          |       py36_cu110        17.9 MB  pytorch
    typing_extensions-3.7.4.3  |             py_0          25 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    wheel-0.36.2               |     pyhd3deb0d_0          31 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    xz-5.2.5                   |       h516909a_1         343 KB  http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

安装完成后, 进入python 解释环境, 输入如下命令查看是否安装成功:

import torch
torch.__version__
torch.cuda.is_available()
torch.cuda.get_device_name(0)
torch.cuda.get_device_name(1)

本人在Python原生环境下配置有 pytorch1.6+cuda10.1 环境, Anaconda下创建的 rtx3090 环境下配置有 pytorch1.7.1+cuda11.0, 使用上述命令查看配置后的环境,可得到如下结果:

$ cuda10
$ python
Python 3.6.11 (default, Jun 29 2020, 05:15:03) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.6.0+cu101'
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'GeForce RTX 3090'
>>> torch.cuda.get_device_name(1)
'GeForce GTX 1080 Ti'
>>>

$ cuda11
switch to cuda 11!
$ inconda rtx3090
Switch to rtx3090
(rtx3090) -----$ python
Python 3.6.12 | packaged by conda-forge | (default, Dec  9 2020, 00:36:02) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.7.1'
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'GeForce RTX 3090'
>>> torch.cuda.get_device_name(1)
'GeForce GTX 1080 Ti'
>>>

源码安装pytorch1.8

参照官方步骤, 首先下载pytorch，torchvision源码：

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive

在conda中创建新环境 pytorch18：

conda create -n pytorch18 python=3.7.9 # 创建新的环境 pytorch18

安装公共依赖：

conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
# Add LAPACK support for the GPU if needed
conda install -c pytorch magma-cuda111  # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo

执行如下命令编译安装:

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install

接下来是漫长的等待过程.

如果提示 Parse error. Expected a command name, got unquoted argument with text, 可能是你的pytorch中CMakeList.txt的编码方式不对,这有在你从Windows拷贝到Ubuntu时会发生, 修改为utf-8编码即可.

如果提示找不到CUDNN版本（Found cuDNN: v?）, 如下，请检查CUDNN的安装过程，应该出错了，参见官方步骤

-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found cuDNN: v?  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Error at 
..... /public/cuda.cmake:... (message):
  PyTorch requires cuDNN 7 and above.

源码安装torchvision

git clone --recursive https://github.com/pytorch/vision.git
cd vision
python setup.py install

RTX3090性能问题

深度学习

RTX 3090 Benchmarks for Deep Learning – NVIDIA RTX 3090 vs 2080 Ti vs TITAN RTX vs RTX 6000/8000
Titan RTX vs RTX 3090 Transformer Benchmarks, Pytorch
Convolution operations are extremely slow on RTX 30 series GPU

Pytorch上测试结果

不同卷积类型

测试1维卷积，2维卷积以及2维卷积中的1维卷积，在benchmark 和 deterministic取不同值时的性能，此测试仅做前向传播，不做反向传播，测试代码如下

import torch
import torch.nn as nn
import time



device = 'cuda:0'
device = 'cuda:1'

niters = 1000

print("Torch version: ", torch.__version__)
print("Torch CUDA version: ", torch.version.cuda)
print("CUDNN Version: ", torch.backends.cudnn.version())
print(torch.cuda.get_device_name(int(device[-1])))

def profile(model, x, benchmark, deterministic, nb_iters):
    torch.backends.cudnn.benchmark = benchmark
    torch.backends.cudnn.deterministic = deterministic

    # warmup
    for _ in range(10):
        out = model(x)

    torch.cuda.synchronize()
    t0 = time.time()
    for _ in range(nb_iters):
        out = model(x)
    torch.cuda.synchronize()
    t1 = time.time()

    return (t1 - t0) / nb_iters


model1 = nn.Sequential(
    nn.Conv1d(24, 256, kernel_size=(12,), stride=(6,), groups=4),
    nn.ReLU(),
    nn.Conv1d(256, 256, kernel_size=(6,), stride=(3,), padding=(2,), groups=4),
    nn.ReLU(),
    nn.Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,), groups=4),
    nn.ReLU(),
)

model1.to(device=device)

x = torch.randn(64, 24, 224, device=device)

time0 = profile(model1, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv1d model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model1, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv1d model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model1, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv1d model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model1, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv1d model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))

model2 = nn.Sequential(
    nn.Conv2d(8, 32, kernel_size=(8, 8), stride=(4, 4)),
    nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
    nn.ReLU()
)
model2.to(device=device)

x = torch.randn(64, 8, 224, 224, device=device)
time0 = profile(model2, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv2d model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model2, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv2d model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model2, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv2d model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model2, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv2d model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))

model3 = nn.Sequential(
    nn.Conv2d(8, 32, kernel_size=(8, 1), stride=(4, 1)),
    nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=(4, 1), stride=(2, 1), padding=(1, 1)),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=(3, 1), stride=(1, 1), padding=(1, 1)),
    nn.ReLU()
)
model3.to(device=device)

x = torch.randn(64, 8, 224, 224, device=device)
time0 = profile(model3, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv2d1 model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model3, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv2d1 model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model3, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv2d1 model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model3, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv2d1 model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))

测试结果如下，由结果可知, 对于2维卷积, 3090比1080ti快了将近1倍, 对于1维卷积提升不大, 另外不同Torch版本也有一定的性能影响:

Torch version:  1.8.0.dev20210106+cu110
Torch CUDA version:  11.0
CUDNN Version:  8005
GeForce RTX 3090
Conv1d model, benchmark=False, deterministic=False, 0.687ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.511ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.540ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.484ms/iter
Conv2d model, benchmark=False, deterministic=False, 1.327ms/iter
Conv2d model, benchmark=True, deterministic=False, 1.335ms/iter
Conv2d model, benchmark=False, deterministic=True, 1.474ms/iter
Conv2d model, benchmark=True, deterministic=True, 1.480ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 3.278ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 3.280ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 3.286ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 3.286ms/iter

Torch version:  1.8.0.dev20210106+cu110
Torch CUDA version:  11.0
CUDNN Version:  8005
GeForce GTX 1080 Ti
Conv1d model, benchmark=False, deterministic=False, 0.709ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.711ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.844ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.711ms/iter
Conv2d model, benchmark=False, deterministic=False, 2.684ms/iter
Conv2d model, benchmark=True, deterministic=False, 2.883ms/iter
Conv2d model, benchmark=False, deterministic=True, 2.212ms/iter
Conv2d model, benchmark=True, deterministic=True, 2.195ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 5.583ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 6.077ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 6.097ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 6.120ms/iter

Torch version:  1.6.0+cu101
Torch CUDA version:  10.1
CUDNN Version:  7603
GeForce GTX 1080 Ti
Conv1d model, benchmark=False, deterministic=False, 0.544ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.542ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.544ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.542ms/iter
Conv2d model, benchmark=False, deterministic=False, 2.149ms/iter
Conv2d model, benchmark=True, deterministic=False, 2.332ms/iter
Conv2d model, benchmark=False, deterministic=True, 2.469ms/iter
Conv2d model, benchmark=True, deterministic=True, 2.483ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 6.097ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 6.637ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 6.658ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 6.679ms/iter

MNIST 分类

训练一个卷积神经网络, 并测试在测试集上的精度, 统计训练和测试耗时, 测试代码如下:

from __future__ import print_function
import argparse
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR

device = 'cuda:0'
#device = 'cuda:1'
num_workers = 1
num_workers = 4
batch_size = 64
epochs = 10
benchmark = True
benchmark = False
deterministic = True
#deterministic = False
cudaTF32 = True
#cudaTF32 = False
cudnnTF32 = True
#cudnnTF32 = False

print("Torch Version: ", torch.__version__)
print("Torch CUDA Version: ", torch.version.cuda)
print("CUDNN Version: ", torch.backends.cudnn.version())
print("GPU Device: ", torch.cuda.get_device_name(int(device[-1])))
print("CUDNN Benchmark: ", benchmark)
print("CUDNN Deterministic: ", deterministic)
print("CUDA TF32: ", cudaTF32)
print("CUDNN TF32: ", cudnnTF32)
print("Workers: ", num_workers)
print("Batch Size: ", batch_size)
print("Epochs: ", epochs)

torch.backends.cudnn.benchmark = benchmark
torch.backends.cudnn.deterministic = deterministic
#torch.backends.cuda.matmul.allow_tf32 = cudaTF32
#torch.backends.cudnn.allow_tf32 = cudnnTF32

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    train_loss = 0.
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
    train_loss /= len(train_loader.dataset)
    return train_loss
    

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    return test_loss

def main():
    global device, num_workers, batch_size, epochs
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=3, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=2020, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    torch.manual_seed(args.seed)

    device = torch.device(device if use_cuda else "cpu")
    args.batch_size = batch_size
    args.epochs = epochs

    kwargs = {'batch_size': args.batch_size}
    if use_cuda:
        kwargs.update({'num_workers': num_workers,
                       'pin_memory': True,
                       'shuffle': True},
                     )

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    dataset1 = datasets.MNIST('../data', train=True, download=True,
                       transform=transform)
    dataset2 = datasets.MNIST('../data', train=False,
                       transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset1,**kwargs)
    test_loader = torch.utils.data.DataLoader(dataset2, **kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    tstart = time.time()
    train_loss, test_loss = 0., 0.
    for epoch in range(1, args.epochs + 1):
        train_loss += train(args, model, device, train_loader, optimizer, epoch)
        test_loss += test(model, device, test_loader)
        scheduler.step()
    tend = time.time()
    train_loss /= args.epochs
    test_loss /= args.epochs

    print("Training Loss: ", train_loss)
    print("Testing  Loss: ", test_loss)
    print("Time: %.4f" % (tend - tstart))
    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
    main()

测试结果如下, 由测试结果知, 3090与1080ti性能相当, 甚至还没有1080ti好, 这与官网宣称的性能差距甚远, 另外本人在自己写的比较复杂的网络模型上测试, 3090的性能更差.

Torch Version:  1.8.0.dev20210106+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDNN Benchmark:  False
CUDNN Deterministic:  True
CUDA TF32:  True
CUDNN TF32:  True
Workers:  4
Batch Size:  64
Epochs:  10
Training Loss:  0.0009187350056997093
Testing  Loss:  0.03026894662413977
Time: 63.0998



Torch Version:  1.8.0.dev20210106+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDNN Benchmark:  False
CUDNN Deterministic:  True
CUDA TF32:  True
CUDNN TF32:  True
Workers:  4
Batch Size:  64
Epochs:  10
Training Loss:  0.0008879667648342487
Testing  Loss:  0.030606915746741615
Time: 56.9057

Torch Version:  1.6.0+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDNN Benchmark:  False
CUDNN Deterministic:  True
Workers:  4
Batch Size:  64
Epochs:  10
Training Loss:  0.0009102054347947707
Testing  Loss:  0.029809928882313725
Time: 52.8144

由于安装的Pytorch为二进制版, 其对应cuda版本为11.0, 而本文安装cuda版本为11.1, 担心这个会影响3090性能的发挥, 因而重新安装cuda11.0, 测试结果如下, 可见cuda版本没有影响

RTX3090 Pytorch1.8 MNIST 运行3次

RTX1080ti Pytorch1.8 MNIST 运行3次

Tensorflow 上测试结果

CFAR图像分类

所用Tensorflow版本为2.4.0, CUDA为11.0, CUDNN为8005, 数据集为CFAR10, 测试代码如下:

主文件

import os
import time
import tensorflow as tf

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    # The CIFAR labels happen to be arrays, 
    # which is why you need the extra index
    plt.xlabel(class_names[train_labels[i][0]])
plt.show()

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

model.summary()

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

tstart = time.time()
history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))
tend = time.time()
print("Training time: ", tend - tstart)

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')

tstart = time.time()
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
tend = time.time()
print("Testing time: ", tend - tstart)

print(test_acc)

测试结果如下, 可见, 在Tensorflow下, 3090的性能也很一般, 跟1080ti差不多:

2021-01-09 22:49:35.037113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22113 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)

2021-01-09 22:49:36.122102: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-09 22:49:36.122543: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2298505000 Hz
Epoch 1/10
2021-01-09 22:49:36.553517: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-09 22:49:37.349985: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-09 22:49:37.353723: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-09 22:49:39.345531: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

1563/1563 [==============================] - 13s 6ms/step - loss: 1.7480 - accuracy: 0.3552 - val_loss: 1.2966 - val_accuracy: 0.5368
Epoch 2/10
1563/1563 [==============================] - 16s 11ms/step - loss: 1.1864 - accuracy: 0.5776 - val_loss: 1.1261 - val_accuracy: 0.6031
Epoch 3/10
1563/1563 [==============================] - 9s 6ms/step - loss: 1.0160 - accuracy: 0.6462 - val_loss: 0.9643 - val_accuracy: 0.6648
Epoch 4/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8975 - accuracy: 0.6860 - val_loss: 0.9399 - val_accuracy: 0.6661
Epoch 5/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8137 - accuracy: 0.7145 - val_loss: 0.9458 - val_accuracy: 0.6683
Epoch 6/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.7547 - accuracy: 0.7366 - val_loss: 0.8510 - val_accuracy: 0.7013
Epoch 7/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6963 - accuracy: 0.7557 - val_loss: 0.8670 - val_accuracy: 0.7034
Epoch 8/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6321 - accuracy: 0.7779 - val_loss: 0.8671 - val_accuracy: 0.7068
Epoch 9/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6121 - accuracy: 0.7854 - val_loss: 0.8556 - val_accuracy: 0.7122
Epoch 10/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.5720 - accuracy: 0.7980 - val_loss: 0.8800 - val_accuracy: 0.7110
Training time:  100.08890771865845
313/313 - 1s - loss: 0.8800 - accuracy: 0.7110
Testing time:  0.9198315143585205
0.7110000252723694


-----------------------------------------------------------------------


2021-01-09 22:53:03.004101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10269 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)

2021-01-09 22:53:03.788911: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-09 22:53:03.789360: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2298505000 Hz
Epoch 1/10
2021-01-09 22:53:04.196189: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-09 22:53:04.408018: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-09 22:53:04.410362: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
1563/1563 [==============================] - 12s 6ms/step - loss: 1.7363 - accuracy: 0.3623 - val_loss: 1.3144 - val_accuracy: 0.5389
Epoch 2/10
1563/1563 [==============================] - 10s 6ms/step - loss: 1.1831 - accuracy: 0.5816 - val_loss: 1.0454 - val_accuracy: 0.6328
Epoch 3/10
1563/1563 [==============================] - 9s 6ms/step - loss: 1.0309 - accuracy: 0.6423 - val_loss: 0.9749 - val_accuracy: 0.6596
Epoch 4/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.9134 - accuracy: 0.6766 - val_loss: 0.9642 - val_accuracy: 0.6651
Epoch 5/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8405 - accuracy: 0.7080 - val_loss: 0.9484 - val_accuracy: 0.6706
Epoch 6/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.7702 - accuracy: 0.7287 - val_loss: 0.8654 - val_accuracy: 0.7013
Epoch 7/10
1563/1563 [==============================] - 10s 6ms/step - loss: 0.7279 - accuracy: 0.7445 - val_loss: 0.8597 - val_accuracy: 0.7013
Epoch 8/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6909 - accuracy: 0.7604 - val_loss: 0.9126 - val_accuracy: 0.6914
Epoch 9/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6469 - accuracy: 0.7717 - val_loss: 0.9200 - val_accuracy: 0.6951
Epoch 10/10
1563/1563 [==============================] - 10s 6ms/step - loss: 0.6022 - accuracy: 0.7892 - val_loss: 0.8853 - val_accuracy: 0.7042
Training time:  97.05110216140747
313/313 - 1s - loss: 0.8853 - accuracy: 0.7042
Testing time:  1.0514824390411377
0.704200029373169

需要注意的事项

Tensor Float32

根据Pytorch文档所述 TF32 on Ampere, 30系列的GPU支持Tensor Float32类型的计算，如果打开则会使用相关加速单元计算，要比普通的浮点数计算快，但是精度会下降, 在Pytorch中默认是打开的。

# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

奇怪的现象

基础环境为:

系统: Ubuntu16.04
CUDA: 8.0, 9.0, 10.1, 11.1 共存

第一种Pytorch版本配置:

原生Python环境Pytorch配置 1.6.0+cu101
Anaconda环境Pytorch配置: 1.8.0.dev20210106+cu110

第二种Pytorch版本配置:

原生Python环境Pytorch配置 1.9.0.dev20210208+cu110 + Python3.6.11
Anaconda环境Pytorch配置: 1.8.0.dev20210208+cu110 + Python3.7.3

第三种Pytorch版本配置:

原生Python环境Pytorch配置 1.8.0.dev20210208+cu110 + Python3.6.11
Anaconda环境Pytorch配置: 1.8.0.dev20210208+cu110 + Python3.7.3

最后发现只要原生Python中的Pytorch版本与Anaconda中的Pytorch版本不一致, 一在GPU上运行程序就会卡死, 只有版本一致时才不卡.

Pytorch 不同版本在不同设备上的性能测试

使用 torch.manual_seed(seed) 设置随机数种子, 保证每次运行产生的网络初始权重相同
torch.backends.cudnn.benchmark 是否允许CUDNN自己寻找较快的卷积实现, 不同device, 不同的卷积, 都会带来卷积实现的速度和精度的差异, 若网络结构非动态, 且数据大小不变化, 可设置为 True, 反之, 应设为 False, 否则反而寻找快速实现会暂用大量时间.
torch.backends.cudnn.deterministic 是否禁止CUDNN使用不确定性算法, 若为 True 则使用确定性算法, 此设置即影响速度又影响精度.
torch.backends.cuda.matmul.allow_tf32 是否允许使用 TensorFloat32 (TF32) 张量核, 设置为 True 会提升速度, 但精度会有损失, 仅ampere 架构GPU支持, 对于不支持的GPU, 此设置不奏效, 不影响
torch.backends.cudnn.allow_tf32 是否允许CUDNN使用 TensorFloat32 (TF32) 张量核, 设置为 True 会提升速度, 但精度会有损失, 仅ampere 架构GPU支持, 对于不支持的GPU, 此设置不奏效, 不影响

pytorch, nvidia ampere tensor cores :speed vs precision, 精度损失挺大的.

为便于比较, 现将时间测试结果如下:


PyTorch 1.8.0.dev20210208+cu110 + RTX 3090+CUDA11

1. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
2. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True : train: 55s, valid: 14s
3. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
4. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True : train: 11s, valid: 8s
5. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
6. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True : train: 54s, valid: 14s
7. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
8. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True : train: 11s, valid: 8s

PyTorch 1.8.0.dev20210208+cu110 + GTX 1080TI+CUDA11

1. Benchmark=False, Deterministic=False: train: 21s, valid: 20s
2. Benchmark=False, Deterministic=True: train: 10s, valid: 10s
3. Benchmark=True, Deterministic=False: train: 18s, valid: 18s
4. Benchmark=True, Deterministic=True: train: 25s, valid: 18s


PyTorch 1.8.0.dev20210210+cu101 + GTX 1080TI+CUDA10

1. Benchmark=False, Deterministic=False: train: 29s, valid: 20s
2. Benchmark=False, Deterministic=True: train: 24s, valid: 20s
3. Benchmark=True, Deterministic=False: train: 17s, valid: 19s
4. Benchmark=True, Deterministic=True: train: 10s, valid: 10s

3090+CUDA11结果

测试配置如下:

CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False 运行2次
CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True 运行2次
CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False 运行2次
CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True 运行2次
CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False 运行2次
CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True 运行2次
CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False 运行2次
CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True 运行2次

由结果可得如下结论:

无论 Benchmark, Deterministic 取何种设置, TF32基本无加速, 而由于精度损失导致指标有所下降.
Benchmark 加速比较大 (无论 CUDATF32, CUDNNTF32, Deterministic 取何值)
Deterministic 加速比很小 (无论 CUDATF32, CUDNNTF32, Benchmark 取何值)

时间结果如下:

CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True : train: 55s, valid: 14s
CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True : train: 11s, valid: 8s
CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True : train: 54s, valid: 14s
CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True : train: 11s, valid: 8s

具体结果如下:

CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7454, time: 51.87094736099243s
--->Valid epoch: 1, loss: 10.0915, entropy: 10.0915, l1norm: 10.0175, contrast: -3.6298, time: 14.616422176361084s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7351, time: 53.827654123306274s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6338, time: 14.419052362442017s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0174, contrast: -3.7504, time: 54.31160569190979s
--->Valid epoch: 1, loss: 10.0903, entropy: 10.0903, l1norm: 10.0172, contrast: -3.6331, time: 14.683520078659058s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7414, time: 54.10149121284485s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6291, time: 14.825896978378296s

CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7438, time: 55.177640199661255s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6328, time: 14.768057823181152s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7462, time: 55.64263844490051s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6270, time: 14.934310674667358s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7438, time: 54.337199211120605s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6328, time: 14.743611574172974s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7462, time: 55.72549605369568s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6270, time: 14.936981439590454s

CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False 运行2次

GeForce RTX 3090
Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0174, contrast: -3.7344, time: 9.696520328521729s
--->Valid epoch: 1, loss: 10.0903, entropy: 10.0903, l1norm: 10.0172, contrast: -3.6324, time: 8.027890682220459s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7427, time: 9.307999849319458s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6288, time: 7.912508487701416s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0175, contrast: -3.7390, time: 10.154499053955078s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6319, time: 7.96985650062561s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0185, contrast: -3.7424, time: 10.21190619468689s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6280, time: 8.24509072303772s

CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7376, time: 10.749010562896729s
--->Valid epoch: 1, loss: 10.0892, entropy: 10.0892, l1norm: 10.0169, contrast: -3.6317, time: 8.022538900375366s
--->Train epoch: 2, loss: 10.0845, entropy: 10.0845, l1norm: 10.0184, contrast: -3.7403, time: 11.13167119026184s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6277, time: 8.124176025390625s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7376, time: 11.482399225234985s
--->Valid epoch: 1, loss: 10.0892, entropy: 10.0892, l1norm: 10.0169, contrast: -3.6317, time: 8.082707166671753s
--->Train epoch: 2, loss: 10.0845, entropy: 10.0845, l1norm: 10.0184, contrast: -3.7403, time: 11.133780717849731s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6277, time: 7.9467527866363525s

CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0174, contrast: -3.7520, time: 53.30528664588928s
--->Valid epoch: 1, loss: 10.0915, entropy: 10.0915, l1norm: 10.0175, contrast: -3.6306, time: 14.644577264785767s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7400, time: 53.856117486953735s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6295, time: 14.902146816253662s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7454, time: 52.86871576309204s
--->Valid epoch: 1, loss: 10.0921, entropy: 10.0921, l1norm: 10.0176, contrast: -3.6283, time: 14.82244610786438s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7385, time: 54.21399688720703s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6301, time: 14.981997728347778s

CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7479, time: 53.832154750823975s
--->Valid epoch: 1, loss: 10.0914, entropy: 10.0914, l1norm: 10.0175, contrast: -3.6294, time: 14.578207015991211s
--->Train epoch: 2, loss: 10.0850, entropy: 10.0850, l1norm: 10.0185, contrast: -3.7328, time: 55.1286780834198s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6291, time: 14.752575635910034s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7479, time: 54.217995166778564s
--->Valid epoch: 1, loss: 10.0914, entropy: 10.0914, l1norm: 10.0175, contrast: -3.6294, time: 14.726275444030762s
--->Train epoch: 2, loss: 10.0850, entropy: 10.0850, l1norm: 10.0185, contrast: -3.7328, time: 58.631773948669434s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6291, time: 15.65324854850769s

CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7399, time: 9.78132176399231s
--->Valid epoch: 1, loss: 10.0895, entropy: 10.0895, l1norm: 10.0170, contrast: -3.6324, time: 7.887274980545044s
--->Train epoch: 2, loss: 10.0848, entropy: 10.0848, l1norm: 10.0185, contrast: -3.7401, time: 9.359825134277344s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6280, time: 7.946653127670288s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0838, entropy: 10.0838, l1norm: 10.0174, contrast: -3.7392, time: 9.73360824584961s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6307, time: 7.910605192184448s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7455, time: 9.480495691299438s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6302, time: 8.091880798339844s

CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7435, time: 10.862586498260498s
--->Valid epoch: 1, loss: 10.0924, entropy: 10.0924, l1norm: 10.0177, contrast: -3.6264, time: 7.919918537139893s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7405, time: 10.470875024795532s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6289, time: 7.927625894546509s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce RTX 3090
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7435, time: 10.894693613052368s
--->Valid epoch: 1, loss: 10.0924, entropy: 10.0924, l1norm: 10.0177, contrast: -3.6264, time: 7.909468173980713s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7405, time: 10.432015419006348s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6289, time: 7.901139974594116s

1080TI+CUDA11

首先看 CUDA TF32 和 CUDNN TF32 在1080ti上是否起作用, 由下面结果可知不起作用.

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.39651656150818s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.37838625907898s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 60.98785209655762s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.49517011642456s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  True
CUDNN TF32:  True
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.536439895629883s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.666481971740723s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.99992823600769s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.800203800201416s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.48216986656189s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.80778193473816s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.893065929412842s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.811289072036743s

下面测试 Benchmark, Deterministic 的影响, 配置如下

Benchmark=False, Deterministic=False 运行2次
Benchmark=False, Deterministic=True 运行2次
Benchmark=True, Deterministic=False 运行2次
Benchmark=True, Deterministic=True 运行2次

下面给出时间结果:

Benchmark=False, Deterministic=False: train: 21s, valid: 20s
Benchmark=False, Deterministic=True: train: 10s, valid: 10s
Benchmark=True, Deterministic=False: train: 18s, valid: 18s
Benchmark=True, Deterministic=True: train: 25s, valid: 18s

具体结果如下:

Benchmark=False, Deterministic=False 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7399, time: 20.35549545288086s
--->Valid epoch: 1, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6324, time: 18.916125535964966s
--->Train epoch: 2, loss: 10.0855, entropy: 10.0855, l1norm: 10.0186, contrast: -3.7225, time: 22.069878578186035s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6277, time: 20.496481895446777s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7452, time: 20.829981803894043s
--->Valid epoch: 1, loss: 10.0902, entropy: 10.0902, l1norm: 10.0172, contrast: -3.6316, time: 20.57017183303833s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0183, contrast: -3.7417, time: 22.458086252212524s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6297, time: 20.48271942138672s

Benchmark=False, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.37249255180359s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.48044991493225s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7361, time: 61.695250272750854s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6316, time: 20.484682321548462s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.24735903739929s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.471797704696655s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7361, time: 61.43708038330078s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6316, time: 20.503613471984863s

Benchmark=True, Deterministic=False 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0841, entropy: 10.0841, l1norm: 10.0176, contrast: -3.7429, time: 18.17175269126892s
--->Valid epoch: 1, loss: 10.0923, entropy: 10.0923, l1norm: 10.0177, contrast: -3.6254, time: 18.15757131576538s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7369, time: 18.792244911193848s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6295, time: 18.823336124420166s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0175, contrast: -3.7487, time: 18.5258526802063s
--->Valid epoch: 1, loss: 10.0930, entropy: 10.0930, l1norm: 10.0178, contrast: -3.6256, time: 18.816319704055786s
--->Train epoch: 2, loss: 10.0852, entropy: 10.0852, l1norm: 10.0185, contrast: -3.7307, time: 18.378268718719482s
--->Valid epoch: 2, loss: 10.0885, entropy: 10.0885, l1norm: 10.0168, contrast: -3.6320, time: 18.800978660583496s

Benchmark=True, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.72299075126648s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.807397603988647s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.79756498336792s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.80388379096985s

Torch Version:  1.8.0.dev20210208+cu110
Torch CUDA Version:  11.0
CUDNN Version:  8005
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.729687929153442s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.8100643157959s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.809940576553345s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.82235860824585s

1080TI+CUDA10

Benchmark=False, Deterministic=False 运行2次
Benchmark=False, Deterministic=True 运行2次
Benchmark=True, Deterministic=False 运行2次
Benchmark=True, Deterministic=True 运行2次

下面给出时间结果:

Benchmark=False, Deterministic=False: train: 29s, valid: 20s
Benchmark=False, Deterministic=True: train: 24s, valid: 20s
Benchmark=True, Deterministic=False: train: 17s, valid: 19s
Benchmark=True, Deterministic=True: train: 10s, valid: 10s

具体结果如下:

Benchmark=False, Deterministic=False 运行2次

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7381, time: 28.030856609344482s
--->Valid epoch: 1, loss: 10.0925, entropy: 10.0925, l1norm: 10.0177, contrast: -3.6266, time: 19.793615102767944s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7522, time: 29.9300274848938s
--->Valid epoch: 2, loss: 10.0899, entropy: 10.0899, l1norm: 10.0172, contrast: -3.6343, time: 20.407748699188232s


GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7364, time: 28.959442138671875s
--->Valid epoch: 1, loss: 10.0906, entropy: 10.0906, l1norm: 10.0173, contrast: -3.6312, time: 20.42161989212036s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7435, time: 29.956018924713135s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6298, time: 20.411144495010376s

Benchmark=False, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0173, contrast: -3.7488, time: 23.54142165184021s
--->Valid epoch: 1, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6269, time: 20.411819219589233s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0187, contrast: -3.7421, time: 24.92523694038391s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6279, time: 20.398447513580322s

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  False
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0173, contrast: -3.7488, time: 23.943784952163696s
--->Valid epoch: 1, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6269, time: 20.42457938194275s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0187, contrast: -3.7421, time: 25.094601154327393s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6279, time: 20.471126079559326s

Benchmark=True, Deterministic=False 运行2次

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0175, contrast: -3.7485, time: 16.708702325820923s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6340, time: 19.010523319244385s
--->Train epoch: 2, loss: 10.0838, entropy: 10.0838, l1norm: 10.0183, contrast: -3.7436, time: 17.553938627243042s
--->Valid epoch: 2, loss: 10.0883, entropy: 10.0883, l1norm: 10.0167, contrast: -3.6318, time: 19.111060619354248s

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  False
--->Train epoch: 1, loss: 10.0840, entropy: 10.0840, l1norm: 10.0175, contrast: -3.7400, time: 16.517553091049194s
--->Valid epoch: 1, loss: 10.0916, entropy: 10.0916, l1norm: 10.0175, contrast: -3.6292, time: 18.997257232666016s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7379, time: 17.554461240768433s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6306, time: 19.144949674606323s

Benchmark=True, Deterministic=True 运行2次

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0839, entropy: 10.0839, l1norm: 10.0175, contrast: -3.7339, time: 20.768550872802734s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6320, time: 19.122512578964233s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7527, time: 21.37337899208069s
--->Valid epoch: 2, loss: 10.0892, entropy: 10.0892, l1norm: 10.0170, contrast: -3.6347, time: 19.099023818969727s

Torch Version:  1.8.0.dev20210210+cu101
Torch CUDA Version:  10.1
CUDNN Version:  7603
GPU Device:  GeForce GTX 1080 Ti
CUDA TF32:  False
CUDNN TF32:  False
CUDNN Benchmark:  True
CUDNN Deterministic:  True
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0175, contrast: -3.7441, time: 21.45243787765503s
--->Valid epoch: 1, loss: 10.0929, entropy: 10.0929, l1norm: 10.0178, contrast: -3.6257, time: 19.00617289543152s
--->Train epoch: 2, loss: 10.0854, entropy: 10.0854, l1norm: 10.0185, contrast: -3.7285, time: 21.43665385246277s
--->Valid epoch: 2, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6309, time: 19.1219220161438s

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Ubuntu16.04 + NVIDIA RTX3090 + Pytorch + Tensorflow

文章目录

说明

有用链接

显卡驱动安装

文件下载

一次性安装显示驱动和cuda计算套件

仅安装显示驱动

仅安装cuda计算套件

安装Pytorch

安装pytorch1.7

源码安装pytorch1.8

源码安装torchvision

RTX3090性能问题

深度学习

Pytorch上测试结果

不同卷积类型

MNIST 分类

Tensorflow 上测试结果

CFAR图像分类

需要注意的事项

Tensor Float32

奇怪的现象

Pytorch 不同版本在不同设备上的性能测试

3090+CUDA11结果

1080TI+CUDA11

1080TI+CUDA10

Ubuntu16.04 + NVIDIA RTX3090 + Pytorch + Tensorflow 的相关文章

随机推荐

热门标签