为什么在 BIOS 中禁用超线程的 Broadwell CPU 上，perf stat 不计算周期：u？

2023-12-12

鉴于： BIOS 中禁用超线程的 Broadwell CPU

[root@ny4srv03 ~]# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  44
  On-line CPU(s) list:   0-43
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel
  Model name:            Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
    BIOS Model name:     Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  1
    Core(s) per socket:  22
    Socket(s):           2
    Stepping:            1
    CPU max MHz:         3700.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            4399.69
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aper
                         fmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_
                         l3 invpcid_single intel_ppin tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local d
                         therm ida arat pln pts
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   1.4 MiB (44 instances)
  L1i:                   1.4 MiB (44 instances)
  L2:                    11 MiB (44 instances)
  L3:                    110 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-21
  NUMA node1 CPU(s):     22-43
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX vulnerable, SMT disabled
  Mds:                   Vulnerable; SMT disabled
  Meltdown:              Vulnerable
  Mmio stale data:       Vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable

根据 Intel 64 和 IA-32 架构软件开发人员手册

如果一个处理器内核由两个逻辑处理器共享，则每个逻辑处理器最多可以访问四个计数器 (IA32 PMCO-IA32 PMC3)。这与基于 Nehalem 微架构的上一代处理器相同。如果处理器内核不被两个逻辑处理器共享，则最多可以看到八个通用计数器。如果 CPUID.OAH:EAX[15:8] 报告 8 个计数器，则 IA32_PMC4-IA32_PMC7 将抄送 MSR 地址 0C5H 到 0C8H。每个计数器都伴随有一个事件选择 MS (IA32_PERFEVTSEL4-IA32_PERFEVTSEL7)。

应有 8 个可访问的性能计数器，并且cpuid确切地表明

[root@ny4srv03 ~]# cpuid -1 | grep counters
      number of counters per logical processor = 0x8 (8)
      number of contiguous fixed counters      = 0x3 (3)
      bit width of fixed counters              = 0x30 (48)

但是如果我尝试使用perf通过以下方式（在root帐户并与kernel.perf_event_paranoid set to -1）我得到一些奇怪的结果

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

                 0      cycles:u
            668753      instructions:u                                                ( +-  0.01% )
            131991      branches:u                                                    ( +-  0.00% )
              6936      branch-misses:u           #    5.25% of all branches          ( +-  0.33% )
             11105      cache-references:u                                            ( +-  0.13% )
                 6      cache-misses:u            #    0.055 % of all cache refs      ( +-  5.86% )
               103      faults:u                                                      ( +-  0.19% )

        0.00100211 +- 0.00000487 seconds time elapsed  ( +-  0.49% )

总是显示cycles:u等于0无论我跑多少次perf（请注意-r 100参数），直到我删除其中之一branches:u, branch-misses:u, cache-references:u, cache-misses:u事件。在这种情况下perf按预期工作

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

            614142      cycles:u                                                      ( +-  0.06% )
            668790      instructions:u            #    1.09  insn per cycle           ( +-  0.00% )
            132052      branches:u                                                    ( +-  0.00% )
              6874      branch-misses:u           #    5.21% of all branches          ( +-  0.11% )
             10735      cache-references:u                                            ( +-  0.05% )
               101      faults:u                                                      ( +-  0.06% )

        0.00095650 +- 0.00000108 seconds time elapsed  ( +-  0.11% )

perf在这些情况下也能按预期工作

如果获取指标cycles事件要么根本没有修饰符，要么带有:k修饰语

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

           1841276      cycles                                                        ( +-  0.79% )
            668400      instructions:u                                                ( +-  0.00% )
            131966      branches:u                                                    ( +-  0.00% )
              6121      branch-misses:u           #    4.64% of all branches          ( +-  0.40% )
             10987      cache-references:u                                            ( +-  0.16% )
                 0      cache-misses:u            #    0.000 % of all cache refs
               102      faults:u                                                      ( +-  0.18% )

        0.00102359 +- 0.00000649 seconds time elapsed  ( +-  0.63% )

如果超线程在 BIOS 中启用并通过以下命令禁用nosmt内核参数

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

            618443      cycles:u                                                      ( +-  0.39% )
            668466      instructions:u            #    1.05  insn per cycle           ( +-  0.00% )
            131968      branches:u                                                    ( +-  0.00% )
              6529      branch-misses:u           #    4.95% of all branches          ( +-  0.34% )
             11096      cache-references:u                                            ( +-  0.47% )
                 1      cache-misses:u            #    0.010 % of all cache refs      ( +- 53.16% )
               107      faults:u                                                      ( +-  0.18% )

        0.00097825 +- 0.00000554 seconds time elapsed  ( +-  0.57% )

在这种情况下cpuid还显示只有 4 个可用的性能计数器

[root@ny4srv03 ~]# cpuid -1 | grep counters
      number of counters per logical processor = 0x4 (4)
      number of contiguous fixed counters      = 0x3 (3)
      bit width of fixed counters              = 0x30 (48)

所以我想知道是否有错误perf或某种系统配置错误。能否请你帮忙？

Update 1

尝试跑步perf -d表明有NMI watchdog enabled

[root@ny4srv03 likwid]# perf stat \
   -e cycles:u \
   -e instructions:u \
   -e branches:u \
   -e branch-misses:u \
   -e cache-references:u \
   -e cache-misses:u \
   -e faults:u \
   -d \
   ls>/dev/null

 Performance counter stats for 'ls':

                 0      cycles:u
            709098      instructions:u
            140131      branches:u
              6826      branch-misses:u           #    4.87% of all branches
             11287      cache-references:u
                 0      cache-misses:u            #    0.000 % of all cache refs
               104      faults:u
            593753      L1-dcache-loads
             32677      L1-dcache-load-misses     #    5.50% of all L1-dcache accesses
              8679      LLC-loads
     <not counted>      LLC-load-misses                                               (0.00%)

       0.001102213 seconds time elapsed

       0.000000000 seconds user
       0.001134000 seconds sys


Some events weren't counted. Try disabling the NMI watchdog:
    echo 0 > /proc/sys/kernel/nmi_watchdog
    perf stat ...
    echo 1 > /proc/sys/kernel/nmi_watchdog

禁用它有助于获得预期结果

echo 0 > /proc/sys/kernel/nmi_watchdog

[root@ny4srv03 likwid]# perf stat \
   -e cycles:u \
   -e instructions:u \
   -e branches:u \
   -e branch-misses:u \
   -e cache-references:u \
   -e cache-misses:u \
   -e faults:u \
   -d \
   ls>/dev/null

 Performance counter stats for 'ls':

            745760      cycles:u
            708833      instructions:u            #    0.95  insn per cycle
            140122      branches:u
              6757      branch-misses:u           #    4.82% of all branches
             11503      cache-references:u
                 0      cache-misses:u            #    0.000 % of all cache refs
               101      faults:u
            586223      L1-dcache-loads
             32856      L1-dcache-load-misses     #    5.60% of all L1-dcache accesses
              8794      LLC-loads
                29      LLC-load-misses           #    0.33% of all LL-cache accesses

       0.001000925 seconds time elapsed

       0.000000000 seconds user
       0.001080000 seconds sys

但还是没有解释为什么cycles:u is 0 with nmi_watchdog即使启用dmesg shows

[    0.300779] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

Update 2

I found 这个好评论在来源中类似的工具套件表明

请注意，Intel Broadwell 上的计数器 PMC4-7 已损坏。他们如果应用用户级或内核级过滤，则不要增加。用户级 LIKWID 中默认过滤，因此添加了内核级过滤 PMC4-7 自动。返回的计数可能要高得多。

所以它可以解释这种行为，所以现在找到这个信息的来源很有趣 if os.

这是勘误表 BDE104，NMI 看门狗占用固定计数器，因此cycles必须使用可编程计数器。

来自英特尔的Xeon-D“规格更新”（勘误表）文档（我没有找到适用于常规 Xeon v4 的文档）

BDE104：通用性能监控计数器 4-7 不会递增，仅使用 USR 模式过滤时不计数

问题：IA32_PMC4-7 MSR（C5H-C8H) 通用性能监控计数器当关联的 CPL 过滤器选择在IA32_PERFEVTSELx MSR's (18AH-18DH) USR 字段（位 16）已设置，而 OS 字段（位 17）未设置。

含义：依赖 IA32_PMC4-7 仅对 USR 事件进行计数的软件将无法运行预期的。仅对操作系统事件或操作系统和 USR 事件一起进行计数不受此影响勘误表。

解决方法：未确定。

NMI 看门狗占用固定计数器 1，该计数器可以正常计数cycles事件。这使得perf为它选择一个可编程计数器，显然是选择了一个有问题的计数器。

禁用 NMI 看门狗后，perf使用固定计数器#1cycles。（它显然支持用户/内核/两者屏蔽。）

我在启用 HT 的 Skylake 系统上进行了测试，因此每个逻辑核心有 4 个可编程计数器，加上固定计数器。

NMI 看门狗已禁用：周期+指令+ 4 个其他事件 - 无多路复用。
NMI 看门狗已禁用：周期+指令+ 5 个其他事件 - 多路复用。（数字如(86.32%)在右侧的新列中，指示此事件处于活动状态的时间； perf 从该分数推断出总时间。）
NMI 看门狗已禁用：5 个事件，不包括周期或指令 - 多路复用。（证实cycles and instructions使用固定计数器）。

确认 4 个任意事件加上任意事件的限制cycles, instructions与启用 NMI 看门狗的对比：

NMI 看门狗已启用：不包括 4 个事件cycles or instructions- 无复用，确认 NMI 看门狗使所有 4 个可编程计数器保持空闲
NMI 看门狗已启用：不包括 4 个事件cycles or instructions- 无复用，确认 NMI 看门狗使所有 4 个可编程计数器保持空闲
NMI 看门狗已启用：4 个事件加cycles- 复用，确认cycles现在必须使用可编程计数器，这意味着 NMI 看门狗使用了该固定计数器。
NMI 看门狗已启用：周期+指令+ 3 个其他事件 - 正如我们所期望的那样，没有复用。进一步证实cycles成为争夺可编程计数器的赛事之一。

如果我使用的话这都是一样的perf stat --all-user or cycles:u.

例如（针对 SO 的窄代码块删除了一些水平空白）

# with NMI watchdog enabled
$ taskset -c 0 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,idq.dsb_uops    -r1 ./a.out

 Performance counter stats for './a.out':

             40.74 msec task-clock            #    0.994 CPUs utilized          
                 0      context-switches      #    0.000 /sec                   
                 0      cpu-migrations        #    0.000 /sec                   
               119      page-faults           #    2.921 K/sec                  
       165,566,262      cycles                #    4.064 GHz      (61.39%)
       160,597,987      instructions          #    0.97  insn per cycle (83.46%)
       286,675,168      uops_issued.any       #    7.036 G/sec       (85.28%)
       286,258,415      uops_executed.thread  #    7.026 G/sec       (85.28%)
        76,619,024      idq.mite_uops         #    1.881 G/sec       (85.28%)
        77,238,565      idq.dsb_uops          #    1.896 G/sec       (82.77%)

       0.040990242 seconds time elapsed

       0.040912000 seconds user
       0.000000000 seconds sys

$ echo 0 | sudo tee  /proc/sys/kernel/nmi_watchdog
0
$ taskset -c 0 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,idq.dsb_uops    -r1 ./a.out

 Performance counter stats for './a.out':

             45.01 msec task-clock            #    0.992 CPUs utilized          
                 0      context-switches      #    0.000 /sec                   
                 0      cpu-migrations        #    0.000 /sec                   
               120      page-faults           #    2.666 K/sec                  
       177,494,136      cycles                #    3.943 GHz                    
       160,265,384      instructions          #    0.90  insn per cycle         
       287,253,352      uops_issued.any       #    6.382 G/sec                  
       286,705,343      uops_executed.thread  #    6.369 G/sec                  
        78,189,827      idq.mite_uops         #    1.737 G/sec                  
        75,911,530      idq.dsb_uops          #    1.686 G/sec                  

       0.045389998 seconds time elapsed

       0.045165000 seconds user
       0.000000000 seconds sys

https://perfmon-events.intel.com/broadwell_server.html说有第三个固定计数器CPU_CLK_UNHALTED.REF_TSC。所以它与计数的计数器是分开的INST_RETIRED.ANY（计数器#0）或CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.THREAD_ANY（计数器#1）。

ref_tsc是固定频率，不是核心时钟周期；如果 NMI 看门狗可以使用它可能会更好，因为我预计它的使用范围要小得多。这cycles甚至是CPU_CLK_UNHALTED.THREAD在 Intel CPU 上，在该逻辑核心处于活动状态时计算核心时钟周期。 Perf 默认计算它。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)