使用非均匀节点优化 CUDA 内核插值

2024-03-22

原问题

我有以下内核使用非均匀节点执行插值，我想对其进行优化：

__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    int PP;
    double P;
    const double alfa=(2.-1./cc)*pi_double-0.01;
    double phi_cap_s;
    cufftDoubleComplex temp;

    double cc_points=cc*points[i];
    double r_cc_points=rint(cc*points[i]);

    temp = make_cuDoubleComplex(0.,0.);

    if(i<M) {   
        for(int m=0; m<(2*K+1); m++) {
            P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));

            if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
            if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
            if(P==0.) phi_cap_s = alfa/pi_double;        

            PP = modulo((r_cc_points + m -K ),(cc*N)); 
            temp.x = temp.x+phi_cap_s*Uj[PP].x; 
            temp.y = temp.y+phi_cap_s*Uj[PP].y; 
        } 

        result[i] = temp; 
    }
}

K 和 cc 是常量，points 包含节点，Uj 是要插值的值。 modulo 是一个基本上以 % 形式工作的函数，但可以适当扩展到负值。对于某种安排，内核调用需要 2.3ms。我已经证实最昂贵的零件是

            if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
            if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
            if(P==0.) phi_cap_s = alfa/pi_double;

大约占总时间的 40%，并且

        PP = modulo((r_cc_points + m -K ),(cc*N)); 
        temp.x = temp.x+phi_cap_s*Uj[PP].x; 
        temp.y = temp.y+phi_cap_s*Uj[PP].y;

大约占 60%。通过Visual Profiler，我已经验证了前者的性能不受存在的影响if陈述。请注意，我想要双精度，所以我避免使用 __exp() 解决方案。我怀疑，对于后者，“随机”内存访问 Uj[PP] 可能负责那么多的计算百分比。关于减少计算时间的技巧或评论有什么建议吗？提前致谢。

以下评论和答案的版本

根据答案和评论中提供的建议，我最终得到了以下代码：

__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    int PP;
    double P,tempd;
    const double alfa=(2.-1./cc)*pi_double-0.01;
    cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);

    double cc_points=cc*points[i];
    double r_cc_points=rint(cc_points);

    cufftDoubleComplex rtemp[(2*K+1)];
    double phi_cap_s[2*K+1];

    if(i<M) {   
     #pragma unroll //unroll the loop
     for(int m=0; m<(2*K+1); m++) {
         PP = modulo(((int)r_cc_points + m -K ),(cc*N)); 
            rtemp[m] = Uj[PP]; //2

         P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));
         if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd);  }
         else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
         else phi_cap_s[m] = alfa/pi_double;  
     }

     #pragma unroll //unroll the loop
     for(int m=0; m<(2*K+1); m++) {
         temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; 
           temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; 
     } 

     result[i] = temp; 
     }
 }

尤其： 1）我将全局内存变量Uj移动到大小为2*K+1的寄存器rtemp数组（在我的例子中K是等于6的常数）； 2）我将变量phi_cap_s移动到2*K+1大小的寄存器； 3）我使用了if ... else语句代替了之前使用的三个if（条件P0.具有相同的出现概率）； 3）我为平方根定义了额外的变量； 4）我使用rsqrt而不是sqrt（据我所知，sqrt()由CUDA计算为1/rsqrt()）；

我一次添加了每个新功能，验证了对原始版本的改进，但我必须说，它们都没有给我带来任何相关的改进。

执行速度受以下因素限制： 1）sin/sinh函数的计算（大约40%的时间）；有没有办法通过以某种方式利用内在数学作为“起始猜测”来以双精度算术计算它们？ 2) 由于映射索引 PP，许多线程最终访问相同的全局内存位置 Uj[PP]；避免这种情况的一种可能性是使用共享内存，但这意味着强大的线程合作。

我的问题是。我完成了吗？即，有什么办法可以改进代码吗？我使用 NVIDIA Visual Profiler 对代码进行了分析，结果如下：

IPC = 1.939 (compute capability 2.1);
Global Memory Load Efficiency = 38.9%;
Global Memory Store Efficiency = 18.8%;
Warp Execution Efficiency = 97%;
Instruction Replay Overhead = 0.7%;

最后，我想指出这个讨论与以下位置的讨论相关：CUDA：CUDA 中的一维三次样条插值 https://stackoverflow.com/questions/10859322/cuda-1-dimensional-cubic-spline-interpolation-in-cuda

使用共享内存的版本

我对使用共享内存进行了可行性研究。我考虑过N=64使得整个Uj适合共享内存。下面是代码（基本上是我的原始版本）

    __global__ void interpolation_shared(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
 {
         int i = threadIdx.x + blockDim.x * blockIdx.x;

     int PP;
     double P;
     const double alfa=(2.-1./cc)*pi_double-0.01;
     double phi_cap_s;
     cufftDoubleComplex temp;

     double cc_points=cc*points[i];
     double r_cc_points=rint(cc*points[i]);

     temp = make_cuDoubleComplex(0.,0.);

     __shared__ cufftDoubleComplex Uj_shared[128];

     if (threadIdx.x < cc*N) Uj_shared[threadIdx.x]=Uj[threadIdx.x];

     if(i<M) {  
         for(int m=0; m<(2*K+1); m++) {
         P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));

         if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
         if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));  
         if(P==0.) phi_cap_s = alfa/pi_double;        

         PP = modulo((r_cc_points + m -K ),(cc*N)); 
         temp.x = temp.x+phi_cap_s*Uj_shared[PP].x; 
         temp.y = temp.y+phi_cap_s*Uj_shared[PP].y; 
      } 

      result[i] = temp; 
    }
 }

结果再次没有显着改善，尽管这可能取决于输入数组的小尺寸。

详细 PTXAS 输出

ptxas : info : Compiling entry function '_Z13interpolationP7double2PdS0_ii' for 'sm_20'
ptxas : info : Function properties for _Z13interpolationP7double2PdS0_ii
  352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 55 registers, 456 bytes cumulative stack size, 52 bytes cmem[0]

第一次扭曲且 m=0 时的 P 值

 0.0124300933082964
 0.0127183892149176
 0.0135847002913749
 0.0161796378170038
 0.0155488126345702
 0.0138890822153499
 0.0121163187739057
 0.0119998374528905
 0.0131600831194518
 0.0109574866163769
 0.00962949548477354
 0.00695850974164358
 0.00446426651940612
 0.00423369284281705
 0.00632921297092537
 0.00655137618976198
 0.00810202954519923
 0.00597974034698723
 0.0076811348379735
 0.00604267951733561
 0.00402922460255439
 0.00111841719893846
 -0.00180949615796777
 -0.00246283218698551
 -0.00183256444286428
 -0.000462696661685413
 0.000725108980390132
 -0.00126793006072035
 0.00152263101649197
 0.0022499598348702
 0.00463681632275836
 0.00359856091027666

模函数

__device__ int modulo(int val, int modulus)
{
   if(val > 0) return val%modulus;
   else
   {
       int P = (-val)%modulus;
       if(P > 0) return modulus -P;
       else return 0;
   }
}

根据答案优化模函数

__device__ int modulo(int val, int _mod)
{
    if(val > 0) return val&(_mod-1);
    else
    {
        int P = (-val)&(_mod-1);
        if(P > 0) return _mod -P;
        else return 0;
    }
}

//your code above
cufftDoubleComplex rtemp[(2*K+1)] //if it fits into available registers, assumes K is a constant

if(i<M) {   
#pragma unroll //unroll the loop
    for(int m=0; m<(2*K+1); m++) {

        PP = modulo((r_cc_points + m -K ),(cc*N)); 
        rtemp[m] = Uj[PP]; //2
    }
#pragma unroll
    for(nt m=0; m<(2*K+1); m++) {
        P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
        // 1
        if(P>0.)  phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
        else if(P<0.)  phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
        else phi_cap_s = alfa/pi_double;  

        temp.x = temp.x+phi_cap_s*rtemp[m].x; //3
        temp.y = temp.y+phi_cap_s*rtemp[m].y; 
    }

    result[i] = temp; 
}

解释

添加 else if 和 else 因为这些条件是互斥的，如果可以的话，您应该在发生概率之后对语句进行排序。例如。如果P
这会将请求的内存获取到多个寄存器，您之前所做的操作肯定会由于没有及时可用的内存用于计算而导致该线程阻塞。请记住，如果一个线程在扭曲中阻塞，则整个扭曲都会被阻塞。如果就绪队列中没有足够的扭曲，程序将阻塞，直到任何扭曲准备就绪。
现在，我们将计算时间进一步向前推进，以补偿错误的内存访问，希望之前完成的计算能够补偿错误的访问模式。

这应该起作用的原因如下：

GMEM 中的内存请求大约 >~400-600 个时钟周期。如果线程尝试对当时不可用的内存执行操作，它将阻塞。这意味着，如果每个内存请求不在 L1-L2 中，则每个 warp 都必须等待该时间或更长时间，直到它可以继续。

我怀疑的是temp.x+phi_cap_s*Uj[PP].x就是这么做的。通过将每个内存传输暂存（步骤 2）到寄存器，然后继续进行下一个暂存，您将允许您在内存传输时执行其他工作，从而隐藏延迟。

当您到达步骤 3 时，内存有望可用，否则您必须等待更少的时间。

If rtemp不适合登记册以实现 100% 占用率，您可能必须分批进行。

你也可以尝试做phi_cap_s放入一个数组并将其放入第一个循环中，如下所示：

#pragma unroll //unroll the loop
    for(int m=0; m<(2*K+1); m++) {
        //stage memory first
        PP = modulo((r_cc_points + m -K ),(cc*N)); 
        rtemp[m] = Uj[PP]; //2

        P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
        // 1
        if(P>0.)  phi_cap_s[m] = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));  
        else if(P<0.)  phi_cap_s[m] = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));   
        else phi_cap_s[m] = alfa/pi_double; 

    }
#pragma unroll
    for(nt m=0; m<(2*K+1); m++) {
        temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; //3
        temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; 
    }

Edit

表达

P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));

可以细分为：

const double cc_diff = cc_points-r_cc_points;
double exp = cc_diff - (double)(m-K);
exp *= exp;
P = (K*K-exp);

这可能会减少使用的指令数量。

Edit 2

__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    int PP;
    double P,tempd;


    cufftDoubleComplex rtemp[(2*K+1)];
    double phi_cap_s[2*K+1];

    if(i<M) {
         const double cc_points=cc*points[i];
         cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);

         const double alfa=(2.-1./cc)*pi_double-0.01;


         const double r_cc_points=rint(cc_points);
         const double cc_diff = cc_points-r_cc_points;

     #pragma unroll //unroll the loop
         for(int m=0; m<(2*K+1); m++) {
             PP = m-k; //reuse PP
             double exp = cc_diff - (double)(PP); //stage exp to be used later, will explain

             PP = modulo(((int)r_cc_points + PP ),(cc*N)); 
             rtemp[m] = Uj[PP]; //2


             exp *= exp;
             P = (K*K-exp);

             if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd);  }
             else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
             else phi_cap_s[m] = alfa/pi_double;  
         }

     #pragma unroll //unroll the loop
         for(int m=0; m<(2*K+1); m++) {
             temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; 
             temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; 
         } 

     result[i] = temp; 
     }
 }

我所做的是将所有计算移入 if 语句内，以释放计算和内存获取方面的一些资源，不知道第一个 if 语句上的差异if(i<M). As m-K代码中出现了两次，我先放入PP计算时使用exp and PP.

您还可以做的就是尝试对指令进行排序，以便在设置变量时，在下一次使用该变量之间尽可能多地发出指令，因为将其设置到变量中大约需要 20 次。寄存器。因此，我将常量 cc_diff 放在顶部，但是，由于这只是一条指令，它可能不会显示任何好处。

模函数

__device__ modulo(int val, int _mod) {
    int p = (val&(_mod-1));// as modulo is always the power of 2
    if(val < 0) {
        return _mod - p;
    } else {
        return p;
    }
}

正如我们所拥有的_mod始终为 2 次幂的整数(cc = 2, N = 64, cc*N = 128），我们可以使用这个函数来代替 mod 运算符。这应该“快得多”。检查一下，这样我的算术就正确了。它来自第 14 页。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

CUDA

interpolation