答案是第一个块:
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_storeu_ps(&b[i],ai2_v);
它已经一次需要四个变量。
这是完整的程序,其中注释掉了等效的代码部分:
#include <iostream>
int main()
{
int i{0};
float a[10] ={1,2,3,4,5,6,7,8,9,10};
float b[10] ={0,0,0,0,0,0,0,0,0,0};
int n = 10;
int unroll = (n/4)*4;
for (i=0; i<unroll; i+=4) {
//b[i] = a[i]*2;
//b[i+1] = a[i+1]*2;
//b[i+2] = a[i+2]*2;
//b[i+3] = a[i+3]*2;
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_storeu_ps(&b[i],ai2_v);
}
for (; i<n; i++) {
b[i] = a[i]*2;
}
for (auto i : a) { std::cout << i << "\t"; }
std::cout << "\n";
for (auto i : b) { std::cout << i << "\t"; }
std::cout << "\n";
return 0;
}
至于效率;看来我的系统上的程序集生成了movups
说明,而手卷代码可以使用movaps
这应该更快。
我使用以下程序进行了一些基准测试:
#include <iostream>
//#define NO_UNROLL
//#define UNROLL
//#define SSE_UNROLL
#define SSE_UNROLL_ALIGNED
int main()
{
const size_t array_size = 100003;
#ifdef SSE_UNROLL_ALIGNED
__declspec(align(16)) int i{0};
__declspec(align(16)) float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
__declspec(align(16)) float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif
#ifndef SSE_UNROLL_ALIGNED
int i{0};
float a[array_size] ={1,2,3,4,5,6,7,8,9,10};
float b[array_size] ={0,0,0,0,0,0,0,0,0,0};
#endif
int n = array_size;
int unroll = (n/4)*4;
for (size_t j{0}; j < 100000; ++j) {
#ifdef NO_UNROLL
for (i=0; i<n; i++) {
b[i] = a[i]*2;
}
#endif
#ifdef UNROLL
for (i=0; i<unroll; i+=4) {
b[i] = a[i]*2;
b[i+1] = a[i+1]*2;
b[i+2] = a[i+2]*2;
b[i+3] = a[i+3]*2;
}
#endif
#ifdef SSE_UNROLL
for (i=0; i<unroll; i+=4) {
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_storeu_ps(&b[i],ai2_v);
}
#endif
#ifdef SSE_UNROLL_ALIGNED
for (i=0; i<unroll; i+=4) {
__m128 ai_v = _mm_load_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v,two_v);
_mm_store_ps(&b[i],ai2_v);
}
#endif
#ifndef NO_UNROLL
for (; i<n; i++) {
b[i] = a[i]*2;
}
#endif
}
//for (auto i : a) { std::cout << i << "\t"; }
//std::cout << "\n";
//for (auto i : b) { std::cout << i << "\t"; }
//std::cout << "\n";
return 0;
}
我得到以下结果(x86):
-
NO_UNROLL
: 0.994秒,编译器没有选择 SSE
-
UNROLL
:3.511秒,使用movups
-
SSE_UNROLL
: 3.315秒,使用movups
-
SSE_UNROLL_ALIGNED
: 3.276秒,使用movaps
因此很明显,在这种情况下展开循环并没有帮助。甚至确保我们使用更高效的movaps
没有多大帮助。
但当编译为 64 位 (x64) 时,我得到了一个更奇怪的结果:
-
NO_UNROLL
: 1.138秒,编译器没有选择 SSE
-
UNROLL
:1.409秒,编译器没有选择 SSE
-
SSE_UNROLL
:1.420秒,编译器仍然没有选择 SSE!
-
SSE_UNROLL_ALIGNED
: 1.476秒,编译器仍然没有选择 SSE!
看来 MSVC 已经看穿了该提案并生成了更好的装配,尽管仍然比我们根本没有尝试任何手动优化的情况慢。