我正在研究 Haswell 端口 0 上的分支单元的功能,从一个非常简单的循环开始:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 10000000
.loop:
dec ecx ;|
jz .end ;| 1 uOP (call it D)
jmp .loop ;| 1 uOP (call it J)
.end:
mov eax, 60
xor edi, edi
syscall
Using perf
我们看到循环以 1c/iter 运行
Performance counter stats for './main' (50 runs):
10,001,055 uops_executed_port_port_6 ( +- 0.00% )
9,999,973 uops_executed_port_port_0 ( +- 0.00% )
10,015,414 cycles:u ( +- 0.02% )
23 resource_stalls_rs ( +- 64.05% )
我对这些结果的解释是:
- D 和 J 都是并行调度的。
- J 的吞吐量倒数为 1 个周期。
- D 和 J 均得到最优调度。
然而,我们也可以看到 RS 永远不会充满。
它最多可以以 2 uOPs/c 的速率调度 uOP,但理论上可以得到 4 uOPs/c,从而在大约 30 c 内产生完整的 RS(对于大小为 60 个融合域条目的 RS)。
据我了解,应该很少有分支错误预测,并且 uOP 应该全部来自 LSD。
所以我看了一下FE:
8,239,091 lsd_cycles_active ( +- 3.10% )
989,320 idq_dsb_cycles ( +- 23.47% )
2,534,972 idq_mite_cycles ( +- 15.43% )
4,929 idq_ms_uops ( +- 8.30% )
0.007429733 seconds time elapsed ( +- 1.79% )
which confirms that the FE is issuing from the LSD1.
However, the LSD never issues 4 uOPs/c:
7,591,866 lsd_cycles_active ( +- 3.17% )
0 lsd_cycles_4_uops
My interpretation is that the LSD cannot issue uOPs from the next iteration2 thereby only sending D J pairs to the BE each cycle.
Is my interpretation correct?
源代码位于这个存储库.
1 There is a bit of variance, I think this is due to the high number of iterations that allows for some context switch.
2 This is sound quite complex to do in hardware with limited circuits depth.