我有两个网络,我正在对它们进行分析以查看哪些操作占用了大部分时间。我注意到CUDA time avg
为了aten::conv2d
不同网络的操作有所不同。这也增加了一个数量级。在我的第一个网络中,它是22us
,而对于第二个网络则是3ms
。我的第一个网络的卷积层高达512
过滤器,但第二个最多只有192
过滤器。因此,我预计第二个网络中卷积运算所花费的平均时间应该更短。相反,它高出 3 个数量级。为什么会出现这种情况呢?
完整的分析输出如下
网络1:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
cudaLaunchKernel 99.80% 933.739ms 99.80% 933.739ms 20.750ms 0.000us 0.00% 0.000us 0.000us 45
model_inference 0.05% 453.000us 100.00% 935.567ms 935.567ms 0.000us 0.00% 195.000us 195.000us 1
aten::cudnn_convolution 0.04% 388.000us 99.84% 934.047ms 103.783ms 195.000us 100.00% 195.000us 21.667us 9
aten::_convolution 0.01% 138.000us 99.88% 934.419ms 103.824ms 0.000us 0.00% 195.000us 21.667us 9
aten::conv2d 0.01% 122.000us 99.89% 934.592ms 103.844ms 0.000us 0.00% 195.000us 21.667us 9
aten::add_ 0.01% 112.000us 0.02% 155.000us 17.222us 0.000us 0.00% 0.000us 0.000us 9
aten::upsample_nearest2d 0.01% 82.000us 0.01% 105.000us 26.250us 0.000us 0.00% 0.000us 0.000us 4
aten::empty 0.01% 79.000us 0.01% 79.000us 3.292us 0.000us 0.00% 0.000us 0.000us 24
aten::threshold 0.01% 74.000us 0.02% 149.000us 18.625us 0.000us 0.00% 0.000us 0.000us 8
aten::_cat 0.01% 71.000us 0.01% 119.000us 29.750us 0.000us 0.00% 0.000us 0.000us 4
aten::relu 0.01% 57.000us 0.02% 206.000us 25.750us 0.000us 0.00% 0.000us 0.000us 8
aten::convolution 0.01% 51.000us 99.88% 934.470ms 103.830ms 0.000us 0.00% 195.000us 21.667us 9
aten::view 0.01% 50.000us 0.01% 50.000us 5.556us 0.000us 0.00% 0.000us 0.000us 9
aten::cat 0.00% 32.000us 0.02% 151.000us 37.750us 0.000us 0.00% 0.000us 0.000us 4
aten::reshape 0.00% 29.000us 0.01% 79.000us 8.778us 0.000us 0.00% 0.000us 0.000us 9
aten::resize_ 0.00% 25.000us 0.00% 25.000us 0.962us 0.000us 0.00% 0.000us 0.000us 26
aten::rsub 0.00% 21.000us 0.00% 33.000us 33.000us 0.000us 0.00% 0.000us 0.000us 1
aten::mul 0.00% 17.000us 0.00% 27.000us 27.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zeros 0.00% 13.000us 0.00% 16.000us 16.000us 0.000us 0.00% 0.000us 0.000us 1
cudaEventRecord 0.00% 12.000us 0.00% 12.000us 1.333us 0.000us 0.00% 0.000us 0.000us 9
cudaBindTexture 0.00% 11.000us 0.00% 11.000us 2.750us 0.000us 0.00% 0.000us 0.000us 4
aten::empty_strided 0.00% 6.000us 0.00% 6.000us 6.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zero_ 0.00% 1.000us 0.00% 1.000us 1.000us 0.000us 0.00% 0.000us 0.000us 1
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::ma... 0.00% 0.000us 0.00% 0.000us 0.000us 195.000us 100.00% 195.000us 195.000us 1
cudaUnbindTexture 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 4
Self CPU time total: 935.583ms
Self CUDA time total: 195.000us
网络2:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaMemcpyAsync 42.86% 1.035s 42.86% 1.035s 11.495ms 0.000us 0.00% 0.000us 0.000us 90
cudaLaunchKernel 34.81% 840.325ms 34.81% 840.325ms 169.969us 0.000us 0.00% 0.000us 0.000us 4944
cudaStreamSynchronize 15.92% 384.331ms 15.92% 384.331ms 5.736ms 0.000us 0.00% 0.000us 0.000us 67
model_inference 1.51% 36.559ms 100.00% 2.414s 2.414s 0.000us 0.00% 1.215s 1.215s 1
aten::fill_ 1.03% 24.843ms 34.91% 842.670ms 7.731ms 8.759ms 0.72% 8.759ms 80.358us 109
aten::sum 0.57% 13.648ms 0.91% 22.019ms 18.123us 57.415ms 4.73% 57.415ms 47.255us 1215
aten::slice 0.50% 12.124ms 0.59% 14.229ms 3.526us 0.000us 0.00% 0.000us 0.000us 4035
aten::mul 0.49% 11.935ms 0.88% 21.340ms 17.293us 492.228ms 40.52% 492.228ms 398.888us 1234
aten::empty 0.44% 10.568ms 0.44% 10.568ms 2.556us 0.000us 0.00% 0.000us 0.000us 4134
aten::clamp 0.31% 7.455ms 0.84% 20.342ms 19.485us 12.405ms 1.02% 24.810ms 23.764us 1044
aten::add 0.25% 6.053ms 0.36% 8.615ms 14.334us 33.147ms 2.73% 33.147ms 55.153us 601
aten::cudnn_convolution 0.18% 4.459ms 0.27% 6.549ms 46.779us 423.769ms 34.88% 423.769ms 3.027ms 140
aten::div 0.16% 3.892ms 0.27% 6.584ms 16.098us 3.225ms 0.27% 3.225ms 7.885us 409
aten::resize_ 0.09% 2.287ms 0.10% 2.445ms 2.582us 75.000us 0.01% 75.000us 0.079us 947
aten::copy_ 0.09% 2.226ms 58.96% 1.423s 6.498ms 80.877ms 6.66% 81.024ms 369.973us 219
aten::_cat 0.09% 2.087ms 0.12% 2.971ms 34.547us 26.689ms 2.20% 26.689ms 310.337us 86
aten::as_strided 0.09% 2.082ms 0.10% 2.305ms 0.554us 0.000us 0.00% 0.000us 0.000us 4164
aten::constant_pad_nd 0.06% 1.497ms 34.09% 822.790ms 9.350ms 0.000us 0.00% 46.706ms 530.750us 88
aten::_convolution 0.05% 1.113ms 0.38% 9.142ms 65.300us 0.000us 0.00% 440.725ms 3.148ms 140
aten::sub 0.04% 1.082ms 0.08% 1.905ms 18.676us 16.975ms 1.40% 16.975ms 166.422us 102
aten::leaky_relu 0.03% 727.000us 0.05% 1.253ms 19.277us 11.039ms 0.91% 11.039ms 169.831us 65
aten::reciprocal 0.03% 722.000us 0.05% 1.258ms 17.971us 10.340ms 0.85% 10.340ms 147.714us 70
aten::index 0.03% 707.000us 0.09% 2.140ms 66.875us 16.861ms 1.39% 17.207ms 537.719us 32
aten::add_ 0.03% 672.000us 0.04% 1.027ms 14.671us 16.956ms 1.40% 16.956ms 242.229us 70
aten::conv2d 0.03% 610.000us 0.43% 10.298ms 73.557us 0.000us 0.00% 440.725ms 3.148ms 140
aten::view 0.03% 605.000us 0.03% 619.000us 2.623us 0.000us 0.00% 0.000us 0.000us 236
aten::empty_strided 0.02% 564.000us 0.02% 564.000us 6.409us 0.000us 0.00% 0.000us 0.000us 88
aten::convolution 0.02% 546.000us 0.40% 9.688ms 69.200us 0.000us 0.00% 440.725ms 3.148ms 140
aten::narrow 0.02% 534.000us 0.06% 1.388ms 4.131us 0.000us 0.00% 0.000us 0.000us 336
aten::cat 0.02% 511.000us 0.14% 3.482ms 40.488us 0.000us 0.00% 26.689ms 310.337us 86
aten::to 0.02% 413.000us 58.86% 1.421s 9.665ms 0.000us 0.00% 42.584ms 289.687us 147
aten::rsub 0.02% 374.000us 0.03% 616.000us 19.250us 92.000us 0.01% 92.000us 2.875us 32
aten::select 0.01% 311.000us 0.01% 354.000us 4.023us 0.000us 0.00% 0.000us 0.000us 88
aten::reshape 0.01% 304.000us 0.03% 660.000us 3.976us 0.000us 0.00% 0.000us 0.000us 166
aten::ceil 0.01% 265.000us 0.03% 717.000us 21.088us 606.000us 0.05% 1.212ms 35.647us 34
aten::permute 0.01% 214.000us 0.01% 249.000us 4.446us 0.000us 0.00% 0.000us 0.000us 56
aten::upsample_bilinear2d 0.01% 199.000us 0.03% 629.000us 34.944us 2.185ms 0.18% 2.260ms 125.556us 18
aten::expand 0.01% 189.000us 0.01% 246.000us 3.417us 0.000us 0.00% 0.000us 0.000us 72
aten::ones 0.01% 180.000us 1.02% 24.632ms 947.385us 0.000us 0.00% 0.000us 0.000us 26
aten::gt 0.01% 162.000us 0.02% 474.000us 29.625us 496.000us 0.04% 992.000us 62.000us 16
aten::repeat 0.01% 154.000us 0.03% 724.000us 60.333us 0.000us 0.00% 0.000us 0.000us 12
cudaEventRecord 0.01% 146.000us 0.01% 146.000us 1.043us 0.000us 0.00% 0.000us 0.000us 140
aten::unsqueeze 0.01% 144.000us 0.01% 177.000us 3.404us 0.000us 0.00% 0.000us 0.000us 52
aten::contiguous 0.01% 139.000us 0.03% 735.000us 22.969us 0.000us 0.00% 346.000us 10.812us 32
aten::mean 0.01% 137.000us 0.01% 214.000us 23.778us 131.000us 0.01% 131.000us 14.556us 9
aten::arange 0.01% 124.000us 0.01% 242.000us 10.083us 0.000us 0.00% 0.000us 0.000us 24
aten::empty_like 0.01% 123.000us 0.01% 284.000us 5.680us 0.000us 0.00% 0.000us 0.000us 50
cudaBindTexture 0.01% 121.000us 0.01% 121.000us 3.025us 0.000us 0.00% 0.000us 0.000us 40
aten::stack 0.00% 112.000us 0.03% 802.000us 50.125us 0.000us 0.00% 158.000us 9.875us 16
aten::floor 0.00% 77.000us 0.01% 191.000us 23.875us 18.000us 0.00% 36.000us 4.500us 8
aten::moveaxis 0.00% 73.000us 0.01% 276.000us 11.500us 0.000us 0.00% 0.000us 0.000us 24
aten::movedim 0.00% 67.000us 0.01% 203.000us 8.458us 0.000us 0.00% 0.000us 0.000us 24
aten::unfold 0.00% 61.000us 0.00% 82.000us 2.562us 0.000us 0.00% 0.000us 0.000us 32
aten::leaky_relu_ 0.00% 51.000us 0.00% 119.000us 23.800us 0.000us 0.00% 789.000us 157.800us 5
aten::_s_where 0.00% 51.000us 0.00% 91.000us 22.750us 536.000us 0.04% 536.000us 134.000us 4
aten::clone 0.00% 36.000us 0.01% 159.000us 31.800us 0.000us 0.00% 435.000us 87.000us 5
aten::where 0.00% 34.000us 0.01% 174.000us 43.500us 0.000us 0.00% 536.000us 134.000us 4
aten::expand_as 0.00% 27.000us 0.00% 70.000us 4.375us 0.000us 0.00% 0.000us 0.000us 16
aten::zeros 0.00% 18.000us 0.00% 29.000us 14.500us 0.000us 0.00% 0.000us 0.000us 2
aten::item 0.00% 16.000us 0.00% 22.000us 2.750us 0.000us 0.00% 0.000us 0.000us 8
aten::detach_ 0.00% 10.000us 0.00% 15.000us 3.750us 0.000us 0.00% 0.000us 0.000us 4
aten::alias 0.00% 8.000us 0.00% 8.000us 0.667us 0.000us 0.00% 0.000us 0.000us 12
aten::_local_scalar_dense 0.00% 6.000us 0.00% 6.000us 0.750us 0.000us 0.00% 0.000us 0.000us 8
detach_ 0.00% 5.000us 0.00% 5.000us 1.250us 0.000us 0.00% 0.000us 0.000us 4
aten::zero_ 0.00% 2.000us 0.00% 2.000us 1.000us 0.000us 0.00% 0.000us 0.000us 2
Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 41.981ms 3.46% 41.981ms 626.582us 67
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 8.759ms 0.72% 8.759ms 105.530us 83
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 37.512ms 3.09% 37.512ms 451.952us 83
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 65.145ms 5.36% 65.145ms 208.131us 313
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 416.783ms 34.31% 416.783ms 494.992us 842
void at::native::reduce_kernel<256, 2, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 2.070ms 0.17% 2.070ms 8.519us 243
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 12.051ms 0.99% 12.051ms 24.950us 483
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 3.225ms 0.27% 3.225ms 7.885us 409
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 12.284ms 1.01% 12.284ms 24.277us 506
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 26.580ms 2.19% 26.580ms 359.189us 74
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 11.039ms 0.91% 11.039ms 169.831us 65
Memcpy DtoD (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 510.000us 0.04% 510.000us 22.174us 23
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::ma... 0.00% 0.000us 0.00% 0.000us 0.000us 62.000us 0.01% 62.000us 5.167us 12
maxwell_scudnn_128x32_relu_interior_nn 0.00% 0.000us 0.00% 0.000us 0.000us 1.320ms 0.11% 1.320ms 132.000us 10
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 10.340ms 0.85% 10.340ms 147.714us 70
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 10.300ms 0.85% 10.300ms 130.380us 79
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 50.898ms 4.19% 50.898ms 242.371us 210
void cudnn::winograd::generateWinogradTilesKernel<0,... 0.00% 0.000us 0.00% 0.000us 0.000us 1.166ms 0.10% 1.166ms 13.250us 88
maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_n... 0.00% 0.000us 0.00% 0.000us 0.000us 150.355ms 12.38% 150.355ms 1.709ms 88
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 3.775ms 0.31% 3.775ms 78.646us 48
maxwell_scudnn_128x128_relu_interior_nn 0.00% 0.000us 0.00% 0.000us 0.000us 106.000us 0.01% 106.000us 106.000us 1
maxwell_scudnn_128x128_relu_small_nn 0.00% 0.000us 0.00% 0.000us 0.000us 104.000us 0.01% 104.000us 104.000us 1
cudaUnbindTexture 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 40
void cudnn::detail::implicit_convolve_sgemm<float, f... 0.00% 0.000us 0.00% 0.000us 0.000us 12.632ms 1.04% 12.632ms 789.500us 16
void at::native::reduce_kernel<256, 2, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 10.000us 0.00% 10.000us 10.000us 1
void at::native::(anonymous namespace)::upsample_bil... 0.00% 0.000us 0.00% 0.000us 0.000us 2.185ms 0.18% 2.185ms 121.389us 18
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 606.000us 0.05% 606.000us 35.647us 17
void at::native::reduce_kernel<128, 4, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 121.000us 0.01% 121.000us 15.125us 8
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 18.000us 0.00% 18.000us 4.500us 4
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 103.000us 0.01% 103.000us 12.875us 8
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 121.000us 0.01% 121.000us 7.562us 16
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 109.000us 0.01% 109.000us 13.625us 8
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 354.000us 0.03% 354.000us 11.062us 32
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 92.000us 0.01% 92.000us 2.875us 32
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 346.000us 0.03% 346.000us 10.812us 32
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.414s
Self CUDA time total: 1.215s
分析代码:
with torch.no_grad():
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("model_inference"):
output_batch = self.frame_predictor(input_batch)
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=10))