在现代 x86 CPU 上,硬件预取是一项重要技术,可在用户代码显式请求缓存行之前将其引入缓存层次结构的各个级别。
The basic idea is that when the processor detects a series of accesses to sequential or strided-sequential1 locations, it will go ahead and fetch further memory locations in the sequence, even before executing the instructions that (may) actually access those locations.
My question is if the detection of a prefetch sequence is based on the full addresses (the actual addresses requested by user code) or the cache line addresses which is pretty much the address excluding the bottom 6 bits2 stripped off.
例如,在具有 64 位缓存线的系统上,访问完整地址1, 2, 3, 65, 150
将访问缓存行0, 0, 0, 1, 2
.
当一系列访问在高速缓存行寻址中比在完整寻址中更规则时,差异可能是相关的。例如,一系列完整地址,例如:
32, 24, 8, 0, 64 + 32, 64 + 24, 64 + 8, 64 + 0, ..., N*64 + 32, N*64 + 24, N*64 + 8, N*64 + 0
在完整地址级别可能看起来不像跨步序列(实际上,它可能会错误地触发向后预取器,因为 4 次访问的每个子序列看起来像 8 字节跨步反向序列),但在缓存行级别,它看起来像是向前推进一次缓存行(就像简单的序列一样0, 8, 16, 24, ...
).
现代硬件上采用的是哪个系统(如果有的话)?
Note:人们还可以想象答案不会基于every访问,但仅访问在预取器正在观察的缓存的某个级别中未命中的访问,但同样的问题仍然适用于“未命中访问”的过滤流。
1Strided-sequential just means that accesses that have the same stride (delta) between them, even if that delta isn't 1. For example, a series of accesses to locations 100, 200, 300, ...
could be detected as strided access with a stride of 100, and in principle the CPU will fetch based on this pattern (which would mean that some cache lines might be "skipped" in the prefetch pattern).
2 Here assuming a 64-bit cache line.