为什么这个程序似乎没有正确融合?

2023-12-09

我怀疑给定的程序没有像预期那样融合,并进行了此测试来确认:

module Main where

import qualified Data.Vector.Unboxed as V

main :: IO ()
main = do

  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  let incAll = V.map (+ 1)

  print 
    . V.sum 

    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 

    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 
    . incAll 

    $ array

越多incAll你补充说,程序的效率越低,我相信,这意味着流融合没有启动。我正在使用 GHC 8.0.1,用堆栈构建它,并且我已经包括了-O2 on .cabal's ghc-options。我错过了什么吗?


Note: I'm using GHC 7.10.3 and stack 1.1.2 on Windows (x64), so your times might differ.

TL;DR

如果您想使用流融合,请确保内联您的函数。

如何融合流

流融合严重依赖于优化器和重写规则,至少使用向量包。因此,让我们检查一下您的程序的哪些版本优化得很好。

最小版本(1incAll)

让我们从简单开始吧。我们首先将程序减少到最少:

-- SOBase.hs
module Main where

import qualified Data.Vector.Unboxed as V

main :: IO ()
main = do

  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  let incAll = V.map (+ 1)

  print 
    . V.sum     
    . incAll    
    $ array

让我们编译它并转储 GHC 生成的核心:

$ stack ghc --package vector -- -O2 SOBase.hs -ddump-simpl -dsuppress-all

main2
main2 =
  case (runSTRep main3) `cast` ...
  of _ { Vector ipv_s6b2 ipv1_s6b3 ipv2_s6b4 ->
  letrec {
    $s$wfoldlM'_loop_s9wM
    $s$wfoldlM'_loop_s9wM =
      \ sc_s9wK sc1_s9wL ->
        case tagToEnum# (>=# sc1_s9wL ipv1_s6b3) of _ {
          False ->
            case indexIntArray# ipv2_s6b4 (+# ipv_s6b2 sc1_s9wL)
            of wild_a5ju { __DEFAULT ->
            $s$wfoldlM'_loop_s9wM (+# sc_s9wK (+# wild_a5ju 1)) (+# sc1_s9wL 1)
            };
          True -> sc_s9wK
        }; } in
  case $s$wfoldlM'_loop_s9wM 0 0 of ww_s94k { __DEFAULT ->
  case $wshowSignedInt 0 ww_s94k ([])
  of _ { (# ww5_a5fH, ww6_a5fI #) ->
  : ww5_a5fH ww6_a5fI
  }
  }
  }

让我们把它变得更漂亮一点:

main2 = let foldLoop s n 
              | n < size  = foldLoop (s + (vec ! n + 1)) (n + 1)
              | otherwise = s
        in print (foldLoop 0 0)

The incAll已内联到函数中:

case indexIntArray# ipv2_s6b4 (+# ipv_s6b2 sc1_s9wL)
                of wild_a5ju { __DEFAULT ->
                $s$wfoldlM'_loop_s9wM (+# sc_s9wK (+# wild_a5ju 1)) (+# sc1_s9wL 1)
                                                  ^^^^^^^^^^^^^^^^

更多来电 (3incAlls)

让我们使用incAll更常见的是:

 -- SO3.hs
module Main where

import qualified Data.Vector.Unboxed as V

main :: IO ()
main = do

  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  let incAll = V.map (+ 1)

  print
    . V.sum

    . incAll
    . incAll
    . incAll

    $ array

我们的核心现在包含什么?

$wincAll
$wincAll =
  \ ww_s999 ww1_s99a ww2_s99b ->
    runSTRep
      (\ @ s_a4Rs s1_a4Rt ->
         case tagToEnum# (<# ww1_s99a 0) of _ {
           False ->
             case divInt# 9223372036854775807 8 of ww4_a5fa { __DEFAULT ->
             case tagToEnum# (># ww1_s99a ww4_a5fa) of _ {
               False ->
                 case newByteArray# (*# ww1_s99a 8) (s1_a4Rt `cast` ...)
                 of _ { (# ipv_a5dy, ipv1_a5dz #) ->
                 letrec {
                   $s$wa_s9DR
                   $s$wa_s9DR =
                     \ sc_s9DN sc1_s9DO sc2_s9DQ ->
                       case tagToEnum# (>=# sc1_s9DO ww1_s99a) of _ {
                         False ->
                           case indexIntArray# ww2_s99b (+# ww_s999 sc1_s9DO)
                           of wild_a5jF { __DEFAULT ->
                           case writeIntArray#
                                  ipv1_a5dz sc_s9DN (+# wild_a5jF 1) (sc2_s9DQ `cast` ...)
                           of s'#_a6Cg { __DEFAULT ->
                           $s$wa_s9DR (+# sc_s9DN 1) (+# sc1_s9DO 1) (s'#_a6Cg `cast` ...)
                           }
                           };
                         True -> (# sc2_s9DQ, I# sc_s9DN #)
                       }; } in
                 case $s$wa_s9DR 0 0 (ipv_a5dy `cast` ...)
                 of _ { (# ipv6_a4Nw, ipv7_a4Nx #) ->
                 case ipv7_a4Nx of _ { I# dt4_a5gC ->
                 case unsafeFreezeByteArray# ipv1_a5dz (ipv6_a4Nw `cast` ...)
                 of _ { (# ipv2_a52B, ipv3_a52C #) ->
                 (# ipv2_a52B `cast` ...,
                    (Vector 0 dt4_a5gC ipv3_a52C) `cast` ... #)
                 }
                 }
                 }
                 };
               True -> case main4 ww1_s99a of wild_00 { }
             }
             };
           True -> case main3 ww1_s99a of wild_00 { }
         })

....

main2
main2 =
  case (runSTRep main5) `cast` ...
  of _ { Vector ww1_s991 ww2_s992 ww3_s993 ->
  case ($wincAll ww1_s991 ww2_s992 ww3_s993) `cast` ...
--      ^^^^^^^^ oh
  of _ { Vector ww5_X99T ww6_X99V ww7_X99X ->
  case ($wincAll ww5_X99T ww6_X99V ww7_X99X) `cast` ...
--      ^^^^^^^^ oh
  of _ { Vector ww9_X99Y ww10_X9a0 ww11_X9a2 ->
  case ($wincAll ww9_X99Y ww10_X9a0 ww11_X9a2) `cast` ...
--      ^^^^^^^^ oh
  of _ { Vector ipv_s6cG ipv1_s6cH ipv2_s6cI ->
  letrec {
    $s$wfoldlM'_loop_s9Du
    $s$wfoldlM'_loop_s9Du =
      \ sc_s9Ds sc1_s9Dt ->
        case tagToEnum# (>=# sc1_s9Dt ipv1_s6cH) of _ {
          False ->
            case indexIntArray# ipv2_s6cI (+# ipv_s6cG sc1_s9Dt)
            of wild_a5jx { __DEFAULT ->
            $s$wfoldlM'_loop_s9Du (+# sc_s9Ds wild_a5jx) (+# sc1_s9Dt 1)
            };
          True -> sc_s9Ds
        }; } in
  case $s$wfoldlM'_loop_s9Du 0 0 of ww12_s99s { __DEFAULT ->
  case $wshowSignedInt 0 ww12_s99s ([])
  of _ { (# ww14_a5fK, ww15_a5fL #) ->
  : ww14_a5fK ww15_a5fL
  }
  }
  }
  }
  }
  }

该函数不再内联!由于它不是内联的,因此流融合无法启动。

内联函数 (3incAlls)

让我们添加一个 INLINE pragma:

-- SO3I.hs
module Main where

import qualified Data.Vector.Unboxed as V

main :: IO ()
main = do

  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  let {-# INLINE incAll #-}
      incAll = V.map (+1)
  print 
    . V.sum 

    . incAll 
    . incAll 
    . incAll 

    $ array
stack ghc --package vector -- -O2 -ddump-simpl SO3I.hs

如何main现在的样子?

main2                                                                         
main2 =                                                                       
  case (runSTRep main3) `cast` ...                                            
  of _ { Vector ipv_s6bG ipv1_s6bH ipv2_s6bI ->                               
  letrec {                                                                    
    $s$wfoldlM'_loop_s9z7                                                     
    $s$wfoldlM'_loop_s9z7 =                                                   
      \ sc_s9z5 sc1_s9z6 ->                                                   
        case tagToEnum# (>=# sc1_s9z6 ipv1_s6bH) of _ {                       
          False ->                                                            
            case indexIntArray# ipv2_s6bI (+# ipv_s6bG sc1_s9z6)              
            of wild_a5jC { __DEFAULT ->                                       
            $s$wfoldlM'_loop_s9z7                                             
              (+# sc_s9z5 (+# (+# (+# wild_a5jC 1) 1) 1)) (+# sc1_s9z6 1)     
            };                                                                
          True -> sc_s9z5                                                     
        }; } in                                                               
  case $s$wfoldlM'_loop_s9z7 0 0 of ww_s96F { __DEFAULT ->                    
  case $wshowSignedInt 0 ww_s96F ([])                                         
  of _ { (# ww5_a5fP, ww6_a5fQ #) ->                                          
  : ww5_a5fP ww6_a5fQ                                                         
  }                                                                           
  }                                                                           
  }                                                                           

Great. incAll已内联,如下所示:

(+# sc_s9z5 (+# (+# (+# wild_a5jC 1) 1) 1)) (+# sc1_s9z6 1)     
                                  ^  ^  ^

所以问题是incAll没有内联,因此你最终没有得到

V.sum . V.map (+1) . V.map (+1) . V.map (+1)

您的原始程序(现已内联,32incAlls)

最后但并非最不重要的一点是,让我们再次尝试您的原始程序,这次使用内联。一切都确定了吗?我们来看看核心:

main2
main2 =
  case (runSTRep main3) `cast` ...
  of _ { Vector ipv_s6xF ipv1_s6xG ipv2_s6xH ->
  letrec {
    $s$wfoldlM'_loop_sajT
    $s$wfoldlM'_loop_sajT =
      \ sc_sajR sc1_sajS ->
        case tagToEnum# (>=# sc1_sajS ipv1_s6xG) of _ {
          False ->
            case indexIntArray# ipv2_s6xH (+# ipv_s6xF sc1_sajS)
            of wild_a5mq { __DEFAULT ->
            $s$wfoldlM'_loop_sajT
              (+#
                 sc_sajR
                 (+#
                    (+#
                       (+#
                          (+#
                             (+#
                                (+#
                                   (+#
                                      (+#
                                         (+#
                                            (+#
                                               (+#
                                                  (+#
                                                     (+#
                                                        (+#
                                                           (+#
                                                              (+#
                                                                 (+#
                                                                    (+#
                                                                       (+#
                                                                          (+#
                                                                             (+#
                                                                                (+#
                                                                                   (+#
                                                                                      (+#
                                                                                         (+#
                                                                                            (+#
                                                                                               (+#
                                                                                                  (+#
                                                                                                     (+#
                                                                                                        (+#
                                                                                                           (+#
                                                                                                              (+#
                                                                                                                 wild_a5mq
                                                                                                                 1)
                                                                                                              1)
                                                                                                           1)
                                                                                                        1)
                                                                                                     1)
                                                                                                  1)
                                                                                               1)
                                                                                            1)
                                                                                         1)
                                                                                      1)
                                                                                   1)
                                                                                1)
                                                                             1)
                                                                          1)
                                                                       1)
                                                                    1)
                                                                 1)
                                                              1)
                                                           1)
                                                        1)
                                                     1)
                                                  1)
                                               1)
                                            1)
                                         1)
                                      1)
                                   1)
                                1)
                             1)
                          1)
                       1)
                    1))
              (+# sc1_sajS 1)
            };
          True -> sc_sajR
        }; } in
  case $s$wfoldlM'_loop_sajT 0 0 of ww_s9Rr { __DEFAULT ->
  case $wshowSignedInt 0 ww_s9Rr ([])
  of _ { (# ww5_a5iD, ww6_a5iE #) ->
  : ww5_a5iD ww6_a5iE
  }
  }
  }

嗯,是。但 GHC 还不够聪明,无法将(+1) . (+1) to (+2)等等。它实际上更快吗?

$ stack ghc --package vector -- -O2 SO.hs && SO.exe +RTS -s
  26,400,052,464 bytes allocated in the heap                                             
           9,736 bytes copied during GC                                                  
     800,026,736 bytes maximum residency (2 sample(s))                                   
          61,328 bytes maximum slop                                                      
            1527 MB total memory in use (0 MB lost due to fragmentation)                 

                                     Tot time (elapsed)  Avg pause  Max pause            
  Gen  0        32 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s            
  Gen  1         2 colls,     0 par    0.000s   0.089s     0.0446s    0.0890s            

  INIT    time    0.000s  (  0.000s elapsed)                                             
  MUT     time    4.453s  (  4.616s elapsed)                                             
  GC      time    0.000s  (  0.090s elapsed)                                             
  EXIT    time    0.000s  (  0.089s elapsed)                                             
  Total   time    4.453s  (  4.795s elapsed)                                             

  %GC     time       0.0%  (1.9% elapsed)                                                

  Alloc rate    5,928,432,834 bytes per MUT second                                       

  Productivity 100.0% of total user, 92.9% of total elapsed                              

原始程序需要 4 秒。对于内联的呢?

$ stack ghc --package vector -- -O2 SOFixed.hs && SOFixed.exe +RTS -s
3200000000
     800,048,112 bytes allocated in the heap
           4,352 bytes copied during GC
          42,664 bytes maximum residency (1 sample(s))
          18,776 bytes maximum slop
             764 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.045s     0.0452s    0.0452s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.188s  (  0.224s elapsed)
  GC      time    0.000s  (  0.045s elapsed)
  EXIT    time    0.000s  (  0.045s elapsed)
  Total   time    0.188s  (  0.315s elapsed)

  %GC     time       0.0%  (14.4% elapsed)

  Alloc rate    4,266,923,264 bytes per MUT second

  Productivity 100.0% of total user, 59.6% of total elapsed

0.1秒。伟大的!顺便说一下,所有(+1)调用被优化为单个addq $32,...下线。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

为什么这个程序似乎没有正确融合? 的相关文章

随机推荐