大多数编译器都会优化 MATMUL(TRANSPOSE(A),B) 吗？

2024-04-02

In a Fortran program, I need to compute several expressions like M · v, M^T · v, M^T · M, M · M^T, etc ... Here, M and v are 2D and 1D arrays of small size (less than 100, typically around 2-10)

我想知道是否写MATMUL(TRANSPOSE(M),v)将在编译时展开为一些代码，其效率如下MATMUL(N,v), where N显式存储为N=TRANSPOSE(M)。我对具有“强”优化标志（例如 -O2、-O3 或 -Ofast）的 gnu 和 ifort 编译器特别感兴趣。

下面您可以看到各种方法的几个执行时间。

system:

英特尔(R) 酷睿(TM) i5-6500T CPU @ 2.50GHz

缓存大小：6144 KB

内存：16MB

GNU Fortran (GCC) 6.3.1 20170216（红帽 6.3.1-3）

艾福特(IFORT) 18.0.5 20180823

BLAS ：对于gnu编译器，使用的blas是默认版本

汇编：
[gnu] $ gfortran -O3 x.f90 -lblas
[intel] $ ifort -O3 -mkl x.f90
执行：
[gnu] $ ./a.out > matmul.gnu.txt
[intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt

为了使结果尽可能中立，我用完成的一组等效操作的平均时间重新调整了答案。我忽略了线程。

矩阵乘向量

比较了六种不同的实现：

manual: do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
matmul: matmul(P,v)
blas N:dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
matmul 转置： matmul(transpose(P),v)
matmul-转置-tmp： Q=transpose(P); w=matmul(Q,v)
blas T: dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)

在图 1 和图 2 中，您可以比较上述情况的时序结果。总的来说，我们可以说临时的使用在两个方面gfortran and ifort不建议。两种编译器都可以优化MATMUL(TRANSPOSE(P),v)好多了。而在gfortran，实施MATMUL比默认的 BLAS 更快，ifort清楚地表明mkl-blas是比较快的。

figure 1: Matrix-vector multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

figure 2: Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

矩阵乘以矩阵

比较了六种不同的实现：

manual: do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
matmul: matmul(P,P)
blas N:dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
matmul 转置： matmul(transpose(P),P)
matmul-转置-tmp： Q=transpose(P); matmul(Q,P)
blas T: dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)

在图 3 和图 4 中，您可以比较上述情况的时序结果。与向量情况相反，仅建议 gfortran 使用临时变量。而在gfortran，实施MATMUL比默认的 BLAS 更快，ifort清楚地表明mkl-blas是比较快的。值得注意的是，手动实现相当于mkl-blas.

figure 3: Matrix-matrix multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.

figure 4: Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.

使用的代码：

program matmul_test

  implicit none

  double precision, dimension(:,:), allocatable :: P,Q,R
  double precision, dimension(:), allocatable :: v,w

  integer :: n,i,j,k,l
  double precision,dimension(12) :: t1,t2

  do n = 1,1000
     allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
     call random_number(P)
     call random_number(v)

     i=0

     i=i+1
     call cpu_time(t1(i))
     do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     w=matmul(P,v)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     w=matmul(transpose(P),v)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=transpose(P)
     w=matmul(Q,v)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=matmul(P,P)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=matmul(transpose(P),P)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     Q=transpose(P)
     R=matmul(Q,P)
     call cpu_time(t2(i))

     i=i+1
     call cpu_time(t1(i))
     call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
     call cpu_time(t2(i))

     write(*,'(I6,12D25.17)') n, t2-t1
     deallocate(P,Q,R,v,w)
  end do

end program matmul_test

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)