[CUDA] 快速入门CUDA(1)-基本了解和HelloWorld

2023-11-19

CUDA基础

文章目录

CUDA基础

1 CUDA简介

CUDA的全程是Computer Unified Device Architecture，是由显卡头子NVIDIA发明的。有的人对于显卡的印象在于它可以玩游戏，效果十分逼真，但从背后而言，正是因为显卡强大的图形计算能力，才使得计算机可以运行这些大型的3D游戏，并且拥有较高的画质和帧数。

2 GPU和CPU架构的不同之处

CPU具有以下特点：

对单线程有优化，运算速度快
善于复杂的控制逻辑，预测等
拥有很大的低延迟缓存来减少平均DRAM的访问时间

它的架构可以被表示为下图
CPU架构图
GPU则具有以下特点：

核心被设计为执行大量的并行线程
核心对于数据的并行计算有优化
使用额外的多线程来优化DRAM的访问时间

它的架构图如下：
GPU架构图
除此之外，还需要知道GPU当中拥有许多流处理器（Streaming Multiprocessor），以及众多CUDA核心。

3 查看GPU硬件信息

默认已经配置好了相关的环境，本文将不再过多赘述，需要的朋友可以自行搜索，有很多的教程。本文均以Linux环境作为演示。

现在已经配置好了环境，那么就需要查看以下我们拥有的GPU硬件信息，这也方便于我们后期设置一些参数。使用以下例程，就可以查看GPU的硬件信息了。

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

/**
 * @brief print device properties
 * 
 * @param prop 
 */
void showDeviceProp(cudaDeviceProp &prop) {
    printf("Device name: %s\n", prop.name);
    printf("  Compute capability: %d.%d\n", prop.major, prop.minor);
    printf("  Clock rate: %d\n", prop.clockRate);
    printf("  Memory clock rate: %d\n", prop.memoryClockRate);
    printf("  Memory bus width: %d\n", prop.memoryBusWidth);
    printf("  Peak memory bandwidth: %d\n", prop.memoryBusWidth);
    printf("  Total global memory: %lu\n", prop.totalGlobalMem);
    printf("  Total shared memory per block: %lu\n", prop.sharedMemPerBlock);
    printf("  Total registers per block: %d\n", prop.regsPerBlock);
    printf("  Warp size: %d\n", prop.warpSize);
    printf("  Maximum memory pitch: %lu\n", prop.memPitch);
    printf("  Maximum threads per block: %d\n", prop.maxThreadsPerBlock);
    printf("  Maximum dimension of block: %d x %d x %d\n", prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
    printf("  Maximum dimension of grid: %d x %d x %d\n", prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
    printf("  Maximum memory alloc size: %lu\n", prop.totalConstMem);
    printf("  Texture alignment: %lu\n", prop.textureAlignment);
    printf("  Concurrent copy and execution: %s\n", prop.deviceOverlap ? "Yes" : "No");
    printf("  Number of multiprocessors: %d\n", prop.multiProcessorCount);
    printf("  Kernel execution timeout: %s\n", prop.kernelExecTimeoutEnabled ? "Yes" : "No");
    printf("  Integrated GPU sharing Host Memory: %s\n", prop.integrated ? "Yes" : "No");
}

int main() {
    int num_devices;
    cudaDeviceProp properties;
    cudaGetDeviceCount(&num_devices);
    printf("%d CUDA devices found\n", num_devices);
    for (int i = 0; i < num_devices; i++) {
        cudaGetDeviceProperties(&properties, i);
        printf("Device %d: \"%s\"\n", i, properties.name);
        showDeviceProp(properties);
    }

    return 0;
}

编译该程序nvcc device_query.cu -o device_query，然后运行./device_query，就可以得到本机的硬件信息了。

1 CUDA devices found
Device 0: "NVIDIA Tesla K40c"
Device name: NVIDIA Tesla K40c
  Compute capability: 3.5
  Clock rate: 745000
  Memory clock rate: 3004000
  Memory bus width: 384
  Peak memory bandwidth: 384
  Total global memory: 11996954624
  Total shared memory per block: 49152
  Total registers per block: 65536
  Warp size: 32
  Maximum memory pitch: 2147483647
  Maximum threads per block: 1024
  Maximum dimension of block: 1024 x 1024 x 64
  Maximum dimension of grid: 2147483647 x 65535 x 65535
  Maximum memory alloc size: 65536
  Texture alignment: 512
  Concurrent copy and execution: Yes
  Number of multiprocessors: 15
  Kernel execution timeout: No
  Integrated GPU sharing Host Memory: No

可以看到是一块英伟达特斯拉K40显卡，计算能力为3.5，以及其他各种参数。看不懂也没有关系，因为不是特别重要，主要是检测一下是否成功配置了相关的环境。如果想看完整的参数，那需要增加更多的语句，并且打印对应的参数，完整参数列表可以在该网站找到英伟达API官网。

4 需要建立的基本概念

代码被分成两部分，一部分是在CPU上，也称之为在Host上，另一部分是在GPU上，也称之为在device上。他们两者的关系如下图所示。
Host和Device的关系
程序开始运行时，先将数据通过总线传给GPU，由GPU运算完毕之后再回传给Host，由于数据传输耗费的时间取决于总线带宽，数据量的大小等因素，所以要尽量避免反复传递数据，这样很可能会出现GPU在等数据的时间比实际运算的时间长。

定义运行在GPU上的Code（核函数）

运行在GPU上的代码需要像下面这样声明
__global__ void mykernel(void) { // 要计算的内容}

__global__表示一个函数要在GPU上运行
此函数遵循C语言的语法，也可以使用CUDA的扩展函数等
核函数会从host上调用
nvcc会将host和device的部分分开来编译

网格grids和线程块blocks

网格grids，在上层，至多可以分成三维的blocks，在不同block当中的线程是不能通信的；线程块blocks在相对较低的层级，同样可以将线程分成三维，而在同一个块中的线程是可以通信的。

对于一个核函数，只能有一个grid，但是可以有多个block，之所以将线程划分为grid和block是为了使得结构更清晰，便于线程管理，灵活运用。

管理模型
上图是一个二维grid和二维block的模型示意图，引用了谭生的博客，他写的很好很全面，想要系统慢慢学习的推荐看他的。

调用核函数

调用核函数需要像如下，下面程序表示的模型就是上图所展示的，一个grid当中有6个block，一个block当中有15个线程。

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

__global__ void mykernel(void) {
    int col_index = threadIdx.x + blockIdx.x * blockDim.x;
    int row_index = threadIdx.y + blockIdx.y * blockDim.y;
    printf("hello from (%d,%d) \n",row_index,col_index);
}

int main(void) {
    dim3 grid(2,3);
    dim3 block(3,5);
    mykernel<<<grid, block>>>();
    // synchronize the device
    cudaDeviceSynchronize();
}

编译nvcc grid_and_block.cu -o a.out之后，运行./a.out，可以观察到，终端中打印出了一共80个坐标，如下所示。

hello from (10,3) 
hello from (10,4) 
hello from (10,5) 
hello from (11,3) 
hello from (11,4) 
hello from (11,5) 
hello from (12,3) 
hello from (12,4) 
hello from (12,5) 
hello from (13,3) 
hello from (13,4) 
hello from (13,5) 
hello from (14,3) 
hello from (14,4) 
hello from (14,5) 
hello from (0,0) 
hello from (0,1) 
hello from (0,2) 
hello from (1,0) 
hello from (1,1) 
hello from (1,2) 
hello from (2,0) 
hello from (2,1) 
hello from (2,2) 
hello from (3,0) 
hello from (3,1) 
hello from (3,2) 
hello from (4,0) 
hello from (4,1) 
hello from (4,2) 
hello from (10,0) 
hello from (10,1) 
hello from (10,2) 
hello from (11,0) 
hello from (11,1) 
hello from (11,2) 
hello from (12,0) 
hello from (12,1) 
hello from (12,2) 
hello from (13,0) 
hello from (13,1) 
hello from (13,2) 
hello from (14,0) 
hello from (14,1) 
hello from (14,2) 
hello from (5,0) 
hello from (5,1) 
hello from (5,2) 
hello from (6,0) 
hello from (6,1) 
hello from (6,2) 
hello from (7,0) 
hello from (7,1) 
hello from (7,2) 
hello from (8,0) 
hello from (8,1) 
hello from (8,2) 
hello from (9,0) 
hello from (9,1) 
hello from (9,2) 
hello from (5,3) 
hello from (5,4) 
hello from (5,5) 
hello from (6,3) 
hello from (6,4) 
hello from (6,5) 
hello from (7,3) 
hello from (7,4) 
hello from (7,5) 
hello from (8,3) 
hello from (8,4) 
hello from (8,5) 
hello from (9,3) 
hello from (9,4) 
hello from (9,5) 
hello from (0,3) 
hello from (0,4) 
hello from (0,5) 
hello from (1,3) 
hello from (1,4) 
hello from (1,5) 
hello from (2,3) 
hello from (2,4) 
hello from (2,5) 
hello from (3,3) 
hello from (3,4) 
hello from (3,5) 
hello from (4,3) 
hello from (4,4) 
hello from (4,5)

5 总结

今天主要是大致了解了一下CUDA是什么，以及最基本的需要建立的概念，然后给出了核函数使用的例子。明天继续更新！

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)