stylegan3:Setting up PyTorch plugin "bias_act_plugin"... Failed或Setting up PyTorch plugin "upfirdn2d_plugin"... Failed
- 1.软件环境⚙️
- 2.问题描述🔍
- 3.解决方法🐡
- 3.1.安装完整的`CUDA toolkit`、`ninja`和`GCC`
- 3.2.将`cuda`库文件链接到正确位置
- 3.3.清理历史编译生成的错误文件并重新运行代码
- 4.结果预览🤔
1.软件环境⚙️
Windows10
教育版64位
Python
3.7.13
WSL 2
Ubuntu 20.04
Pytorch
1.13.1
CUDA
11.7
GPU
RTX4090+RTX6000
2.问题描述🔍
今天想用新加的RTX4090
跑一下stylegan3
,但是却疯狂报错:
Setting up PyTorch plugin "bias_act_plugin"... Failed!
/home/jayce/project/code/Python/Git/MAT/torch_utils/ops/bias_act.py:50: UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:
Traceback (most recent call last):
File "/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1906, in _run_ninja_build
env=env)
File "/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jayce/project/code/Python/Git/MAT/torch_utils/ops/bias_act.py", line 48, in _init
_plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "/home/jayce/project/code/Python/Git/MAT/torch_utils/custom_ops.py", line 110, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1296, in load
keep_intermediates=keep_intermediates)
File "/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1518, in _jit_compile
is_standalone=is_standalone)
File "/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1626, in _write_ninja_file_and_build_library
error_prefix=f"Error building extension '{name}'")
File "/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'bias_act_plugin': [1/3] /home/jayce/anaconda3/envs/cuda118+py37/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include/TH -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include/THC -isystem /home/jayce/anaconda3/envs/cuda118+py37/include -isystem /home/jayce/anaconda3/envs/cuda118+py37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /home/jayce/project/code/Python/Git/MAT/torch_utils/ops/bias_act.cu -o bias_act.cuda.o
[2/3] c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include/TH -isystem /home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/include/THC -isystem /home/jayce/anaconda3/envs/cuda118+py37/include -isystem /home/jayce/anaconda3/envs/cuda118+py37/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/jayce/project/code/Python/Git/MAT/torch_utils/ops/bias_act.cpp -o bias_act.o
[3/3] c++ bias_act.o bias_act.cuda.o -shared -L/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/jayce/anaconda3/envs/cuda118+py37/lib64 -lcudart -o bias_act_plugin.so
FAILED: bias_act_plugin.so
c++ bias_act.o bias_act.cuda.o -shared -L/home/jayce/anaconda3/envs/cuda118+py37/lib/python3.7/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/jayce/anaconda3/envs/cuda118+py37/lib64 -lcudart -o bias_act_plugin.so
/usr/bin/ld: 找不到 -lcudart
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
这是因为官方推荐的Pytorch 1.7.1
版本其实没办法是支持RTX4090
,如下为官方推荐的cuda
环境:
Python 3.7
PyTorch 1.7.1
Cuda 11.0
会报错:ValueError: Unknown CUDA arch (8.9) or GPU not supported
Traceback (most recent call last):
File "/home/jayce/project/code/Python/Git/MAT/torch_utils/ops/bias_act.py", line 48, in _init
_plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "/home/jayce/project/code/Python/Git/MAT/torch_utils/custom_ops.py", line 110, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "/home/jayce/anaconda3/envs/cuda11.0+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 980, in load
keep_intermediates=keep_intermediates)
File "/home/jayce/anaconda3/envs/cuda11.0+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1185, in _jit_compile
with_cuda=with_cuda)
File "/home/jayce/anaconda3/envs/cuda11.0+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1276, in _write_ninja_file_and_build_library
with_cuda=with_cuda)
File "/home/jayce/anaconda3/envs/cuda11.0+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1602, in _write_ninja_file_to_build_library
cuda_flags = common_cflags + COMMON_NVCC_FLAGS + _get_cuda_arch_flags()
File "/home/jayce/anaconda3/envs/cuda11.0+py37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1421, in _get_cuda_arch_flags
raise ValueError("Unknown CUDA arch ({}) or GPU not supported".format(arch))
ValueError: Unknown CUDA arch (8.9) or GPU not supported
可以看到,在自己编译自定义的算子,最高支持到了Ampere
即30
系显卡,并不支持40
系的Ada
显卡:
虽然Cuda
的Kernels
没法构建,但其实还是可以运行并得到结果的,只不过速度就不尽人意了,程序也会提示你:
UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation.
所以我们直接升级到目前最新的Pytorch最新的的Stable版本1.13.1
:
Python 3.7
PyTorch 1.13.1
Cuda 11.7
可以在torch/utils/cpp_extension.py
中看到已经支持40
系以及更牛逼的Hopper
计算卡:
这个时候就意味着你的系统有多个cuda
环境了,这种时候就会因为环境太多导致编译的时候出错,最终出现了题目中的报错!
那这个时候又该怎么办呢?
3.解决方法🐡
3.1.安装完整的CUDA toolkit
、ninja
和GCC
在解决这个问题之前,需要装一下完整的Cuda toolkit
(如我就是单独安装Cuda 11.7
),因为Pytorch
自带的Cuda
其实不满足我们自己编译额外算子的需求,两者的具体区别见我之前的文章电脑装了pytorch没有安装cuda,nvcc -V没反应,但能正常使用gpu,官方也给出了提示:
3.2.将cuda
库文件链接到正确位置
仔细查看我们的报错,发现是ninja在编译内核的时候报错了:
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
那为什么没用4090
的时候,ninja
编译不报错呢?大概率是因为你的cuda
相关的变量较为单一,Pytorch
正确唤醒了合适的cuda
环境变量,这点官方也有讲:
而我们现在电脑上面有2
个cuda
,导致程序导入变量混乱,最后就出错了!
因此注意找cuda
相关的变量,发现有一个报错:
/usr/bin/ld: 找不到 -lcudart
-lcudart
是什么呢?
这种一般是cuda
相关的库文件,除了-lcudart
,一般还有其他cannot find -lxxx
,它对应的库文件的命名规则是:lib+库名(即xxx)+.so
,如-lcudart
对应着libcudart.so
。
因此,我们直接使用locate
命令定位一下libcudart.so
,注意这里选择的是你想要的那个cuda
版本,我们这里是11.7
:
WSL 2用户
安装locate
有问题的可以看我另一篇博客: WSL2安装locate命令一直显示Initializing mlocate database; this may take some
time,进度一直卡在60%):
直接将其连接到正确位置即可:
sudo ln -s /usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudart.so /usr/lib/libcudart.so
其实使用下面这个命令其实要好一点,它两都指向libcudart.so.11.0
:
因此效果是一样的,但是下面这个命令更灵活,因为你设置CUDA_HOME
变量的时候cuda
会指向具体的cuda-x.y
,如我这边指向的是11.7
因此我们直接选择指定/usr/local/cuda/lib64/libcudart.so
:
sudo ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so
PS
:ln
是linux
中一个非常重要命令,它的功能是为某一个文件或目录在另外一个位置建立一个同步的链接,类似Windows
下的超级链接。
这个命令最常用的参数是-s
,具体用法是: bash sudo ln -s 源文件 目标文件
3.3.清理历史编译生成的错误文件并重新运行代码
直接删除/home/jayce/.cache/torch_extensions
文件夹,这里面是之前编译生成的文件:
然后再次运行代码即可!
4.结果预览🤔
再次运行代码再也不报错啦:
Setting up PyTorch plugin "bias_act_plugin"... Failed
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed
已经变为:
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
渣男!都看到这里了,还不赶紧点赞
,评论
,收藏
走一波?
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)