TL;DR:根据您提供的代码,在您的特定用法的两种情况下,似乎cudaSetDevice()
正在替换堆栈顶部的上下文。
让我们稍微修改一下您的代码,然后看看我们可以推断出代码中每个 API 调用对上下文堆栈的影响:
$ cat t1759.cu
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cassert>
#include <iostream>
void check(int j, CUcontext ctx1, CUcontext ctx2){
CUcontext ctx0;
int i = 0;
while (true) {
auto status = cuCtxPopCurrent(&ctx0);
if (status != CUDA_SUCCESS) { break; }
if (ctx0 == ctx1) std::cout << j << ":Next context on stack (" << i++ << ") is ctx1:" << (void*) ctx0 << '\n';
else if (ctx0 == ctx2) std::cout << j << ":Next context on stack (" << i++ << ") is ctx2:" << (void*) ctx0 << '\n';
else std::cout << j << ":Next context on stack (" << i++ << ") is unknown:" << (void*) ctx0 << '\n';
}
}
void runtest(int i)
{
CUcontext ctx1, primary = NULL;
cuInit(0);
auto dstatus = cuCtxCreate(&ctx1, 0, 0); // checkpoint 1
assert (dstatus == CUDA_SUCCESS);
if (i == 1) {check(i,ctx1,primary); return;}// checkpoint 1
dstatus = cuCtxPushCurrent(ctx1); // checkpoint 2
assert (dstatus == CUDA_SUCCESS);
if (i == 2) {check(i,ctx1,primary); return;}// checkpoint 2
auto rstatus = cudaSetDevice(0); // checkpoint 3
assert (rstatus == cudaSuccess);
if (i == 3) {check(i,ctx1,primary); return;}// checkpoint 3
void* ptr1;
void* ptr2;
rstatus = cudaMalloc(&ptr1, 1024); // checkpoint 4
assert (rstatus == cudaSuccess);
if (i == 4) {check(i,ctx1,primary); return;}// checkpoint 4
dstatus = cuCtxGetCurrent(&primary); // checkpoint 5
assert (dstatus == CUDA_SUCCESS);
assert(primary != ctx1);
if (i == 5) {check(i,ctx1,primary); return;}// checkpoint 5
dstatus = cuCtxPushCurrent(ctx1); // checkpoint 6
assert (dstatus == CUDA_SUCCESS);
if (i == 6) {check(i,ctx1,primary); return;}// checkpoint 6
rstatus = cudaMalloc(&ptr2, 1024); // checkpoint 7
assert (rstatus == cudaSuccess);
if (i == 7) {check(i,ctx1,primary); return;}// checkpoint 7
rstatus = cudaSetDevice(0); // checkpoint 8
assert (rstatus == cudaSuccess);
if (i == 8) {check(i,ctx1,primary); return;}// checkpoint 8
return;
}
int main(){
for (int i = 1; i < 9; i++){
cudaDeviceReset();
runtest(i);}
}
$ nvcc -o t1759 t1759.cu -lcuda -std=c++11
$ ./t1759
1:Next context on stack (0) is ctx1:0x11087e0
2:Next context on stack (0) is ctx1:0x1741160
2:Next context on stack (1) is ctx1:0x1741160
3:Next context on stack (0) is unknown:0x10dc520
3:Next context on stack (1) is ctx1:0x1c5aa70
4:Next context on stack (0) is unknown:0x10dc520
4:Next context on stack (1) is ctx1:0x23eaa00
5:Next context on stack (0) is ctx2:0x10dc520
5:Next context on stack (1) is ctx1:0x32caf30
6:Next context on stack (0) is ctx1:0x3a44ed0
6:Next context on stack (1) is ctx2:0x10dc520
6:Next context on stack (2) is ctx1:0x3a44ed0
7:Next context on stack (0) is ctx1:0x41cfd90
7:Next context on stack (1) is ctx2:0x10dc520
7:Next context on stack (2) is ctx1:0x41cfd90
8:Next context on stack (0) is ctx2:0x10dc520
8:Next context on stack (1) is ctx2:0x10dc520
8:Next context on stack (2) is ctx1:0x4959c70
$
根据上述内容,当我们继续执行代码中的每个 API 调用时:
1.
auto dstatus = cuCtxCreate(&ctx1, 0, 0); // checkpoint 1
1:Next context on stack (0) is ctx1:0x11087e0
如前所述,上下文创建还将新创建的上下文推送到堆栈上here https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX_1g9051f2d5c31501997a6cb0530290a300.
2.
dstatus = cuCtxPushCurrent(ctx1); // checkpoint 2
2:Next context on stack (0) is ctx1:0x1741160
2:Next context on stack (1) is ctx1:0x1741160
毫不奇怪,将相同的上下文推入堆栈会为其创建另一个堆栈条目。
3.
auto rstatus = cudaSetDevice(0); // checkpoint 3
3:Next context on stack (0) is unknown:0x10dc520
3:Next context on stack (1) is ctx1:0x1c5aa70
The cudaSetDevice()
通话有replaced具有“未知”上下文的堆栈顶部。 (目前尚不清楚,因为我们尚未检索“其他”上下文的句柄值)。
4.
rstatus = cudaMalloc(&ptr1, 1024); // checkpoint 4
4:Next context on stack (0) is unknown:0x10dc520
4:Next context on stack (1) is ctx1:0x23eaa00
由于此调用,堆栈配置没有差异。
5.
dstatus = cuCtxGetCurrent(&primary); // checkpoint 5
5:Next context on stack (0) is ctx2:0x10dc520
5:Next context on stack (1) is ctx1:0x32caf30
由于此调用,堆栈配置没有差异,但我们现在知道堆栈上下文的顶部是当前上下文(我们可以推测它是主上下文)。
6.
dstatus = cuCtxPushCurrent(ctx1); // checkpoint 6
6:Next context on stack (0) is ctx1:0x3a44ed0
6:Next context on stack (1) is ctx2:0x10dc520
6:Next context on stack (2) is ctx1:0x3a44ed0
这里没有真正的惊喜。我们正在推动ctx1
在堆栈上,因此堆栈有 3 个条目,第一个条目是驱动程序 API 创建的上下文,接下来的两个条目与步骤 5 中的堆栈配置相同,只是向下移动一个堆栈位置。
7.
rstatus = cudaMalloc(&ptr2, 1024); // checkpoint 7
7:Next context on stack (0) is ctx1:0x41cfd90
7:Next context on stack (1) is ctx2:0x10dc520
7:Next context on stack (2) is ctx1:0x41cfd90
同样,此调用对堆栈配置没有影响。
8.
rstatus = cudaSetDevice(0); // checkpoint 8
8:Next context on stack (0) is ctx2:0x10dc520
8:Next context on stack (1) is ctx2:0x10dc520
8:Next context on stack (2) is ctx1:0x4959c70
我们再次看到这里的行为是cudaSetDevice()
通话有replaced堆栈上下文的顶部与主上下文。
我从你的测试代码中得到的结论是我看到no的行为不一致cudaSetDevice()
与代码中的各种运行时和驱动程序 API 调用混合时调用。
从我的角度来看,这种编程范式是疯狂的。我无法想象为什么你会想以这种方式混合驱动程序 API 和运行时 API 代码。