下面几行代码
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
编译时需要 11130 usecs 在我的 i5-3230M 上运行
g++ -o main main.cpp -std=c++0x -O3
也就是说,当 openmp 编译指示被忽略时。
另一方面,编译时只需要 1496 usecs
g++ -o main main.cpp -std=c++0x -O3 -fopenmp
速度快了 6 倍多,考虑到它是在 2 核机器上运行,这是相当令人惊讶的。事实上,我也测试过线程数(1)而且性能的提升还是相当重要的(快了3倍以上)。
任何人都可以帮助我理解这种行为吗?
编辑:根据建议,我提供完整的代码:
#include <stdlib.h>
#include <iostream>
#include <chrono>
#include <cassert>
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;
void func()
{
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
}
int main()
{
// alloc & initializacion
buff = (unsigned char *) malloc( numel );
assert(buff != NULL);
for(int k=0; k<numel; k++)
buff[k] = 0;
//
std::chrono::high_resolution_clock::time_point begin;
std::chrono::high_resolution_clock::time_point end;
begin = std::chrono::high_resolution_clock::now();
//
for(int k=0; k<100; k++)
func();
//
end = std::chrono::high_resolution_clock::now();
auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;
return 0;
}