我已经实现了一个带有结构内存布局数组的简单线性探测哈希图。该结构包含键、值和指示条目是否有效的标志。默认情况下,该结构体由编译器填充,因为键和值是 64 位整数,但该条目仅占用 8 个布尔值。因此,我也尝试以未对齐访问为代价来打包结构。由于内存密度更高(我们不会在传输填充字节上浪费带宽),我希望从打包/未对齐版本中获得更好的性能。
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
为了研究这一点,我尝试构建一个更小的微基准。在此微基准测试中,我们执行类似的基准测试,但没有哈希映射查找逻辑(即,我们只是在数组中选择随机索引并在那里前进一点)。请在此处查找基准:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
运行基准测试时,我选择重复= 3和进度= 5,即编译并运行后,您必须输入3(并按换行符),然后输入5并按回车/换行符。我更新了源代码,以(a)避免编译器展开循环,因为重复/前进是硬编码的;(b)切换到指向该向量的指针,因为它更类似于哈希映射正在执行的操作。
在 Intel CPU 上,我得到:
填充运行时间(毫秒):13204
未对齐运行时间(毫秒):12185
在 AMD CPU 上,我得到:
填充运行时间(毫秒):28432
未对齐运行时间(毫秒):22926
因此,虽然在这个微基准测试中,英特尔仍然从未对齐的访问中受益匪浅,但对于AMD CPU来说,绝对和相对改进都更高。我无法解释这一点。一般来说,根据我从相关 SO 线程了解到的情况,单个成员的未对齐访问与对齐访问一样昂贵,只要它保留在单个缓存行 (1) 内。同样在(1)中,给出了对(2)的参考,其中声称缓存读取宽度可以与缓存行大小不同。然而,除了 Linus Torvalds 的邮件之外,我找不到任何其他有关处理器中缓存获取宽度的文档,尤其是我的具体两个 CPU 无法找出这是否与此有关。
有人知道为什么 AMD CPU 从结构打包中获益更多吗?如果是关于减少内存带宽消耗,我应该能够看到对两个 CPU 的影响。如果带宽使用情况相似,我不明白是什么导致了这里的差异。
太感谢了。
(1) 相关SO线程:如何在 x86_64 上准确地衡量未对齐访问速度? https://stackoverflow.com/questions/45128763/how-can-i-accurately-benchmark-unaligned-access-speed-on-x86-64
(2)