mmsegmentation训练的过程中eval时报错,环境如下:
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
GPU 0,1: NVIDIA RTX A6000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.55
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 1.12.0+cu116
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.6
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.3.2
- Magma 2.5.2
TorchVision: 0.13.0+cu116
OpenCV: 4.6.0
MMCV: 1.5.3
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.5
MMSegmentation: 0.26.0+891448f
GitHub上看到有人说是torch版本太新导致的,遂安装torch1.11.0和对应的torchvision0.12.0,运行后发现报了新的错误:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11)
原来是mmcv-full版本又不对应了,下载与torch版本对应的mmcv-full重新安装即可,配置后的新环境为:
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
GPU 0,1: NVIDIA RTX A6000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.55
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 1.11.0+cu115
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.5
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-
TorchVision: 0.12.0+cu115
OpenCV: 4.6.0
MMCV: 1.6.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.5
MMSegmentation: 0.26.0+891448f
实验发现虽然机器上的是CUDA11.6,但是11.5版本的torch也能顺利运行,OK~