问题介绍
我正在尝试加快我正在编写的(2d)光线追踪器的相交代码。我使用 C# 和 System.Numerics 库来提高 SIMD 指令的速度。
问题是我得到了奇怪的结果,有超顶加速和相当低的加速。我的问题是,为什么一个是在屋顶之上,而另一个是相当低的?
Context:
- RayPack 结构是一系列(不同的)光线,包装在 System.Numerics 的向量中。
- BoundingBoxPack 和 CirclePack 结构是单个 bb / 圆,包装在 System.Numerics 向量中。
- 使用的 CPU 是 i7-4710HQ (Haswell),具有 SSE 4.2、AVX(2) 和 FMA(3) 指令。
- 在发布模式下运行(64 位)。该项目在 .Net Framework 472 中运行。未设置其他选项。
Attempts
我已经尝试查找某些操作是否可能得到正确支持(请注意:这些操作适用于 c++)。https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ or http://sci.tuomastonteri.fi/programming/sse http://sci.tuomastonteri.fi/programming/sse),但情况似乎并非如此,因为我使用的笔记本电脑支持 SSE 4.2。
在当前代码中,应用了以下更改:
- 使用更正确的指令(例如,包装分钟)。
- 没有使用float * vector指令(会导致很多额外的操作,参见原文的汇编)。
代码...片段?
对于大量的代码表示歉意,但我不确定在没有这么多代码的情况下我们如何具体讨论这个问题。
Ray 代码 -> BoundingBox
public bool CollidesWith(Ray ray, out float t)
{
// https://gamedev.stackexchange.com/questions/18436/most-efficient-aabb-vs-ray-collision-algorithms
// r.dir is unit direction vector of ray
float dirfracx = 1.0f / ray.direction.X;
float dirfracy = 1.0f / ray.direction.Y;
// lb is the corner of AABB with minimal coordinates - left bottom, rt is maximal corner
// r.org is origin of ray
float t1 = (this.rx.min - ray.origin.X) * dirfracx;
float t2 = (this.rx.max - ray.origin.X) * dirfracx;
float t3 = (this.ry.min - ray.origin.Y) * dirfracy;
float t4 = (this.ry.max - ray.origin.Y) * dirfracy;
float tmin = Math.Max(Math.Min(t1, t2), Math.Min(t3, t4));
float tmax = Math.Min(Math.Max(t1, t2), Math.Max(t3, t4));
// if tmax < 0, ray (line) is intersecting AABB, but the whole AABB is behind us
if (tmax < 0)
{
t = tmax;
return false;
}
// if tmin > tmax, ray doesn't intersect AABB
if (tmin > tmax)
{
t = tmax;
return false;
}
t = tmin;
return true;
}
RayPak 代码 -> Bounding Box Pack
public Vector<int> CollidesWith(ref RayPack ray, out Vector<float> t)
{
// ------------------------------------------------------- \\
// compute the collision. \\
Vector<float> dirfracx = Constants.ones / ray.direction.x;
Vector<float> dirfracy = Constants.ones / ray.direction.y;
Vector<float> t1 = (this.rx.min - ray.origin.x) * dirfracx;
Vector<float> t2 = (this.rx.max - ray.origin.x) * dirfracx;
Vector<float> t3 = (this.ry.min - ray.origin.y) * dirfracy;
Vector<float> t4 = (this.ry.max - ray.origin.y) * dirfracy;
Vector<float> tmin = Vector.Max(Vector.Min(t1, t2), Vector.Min(t3, t4));
Vector<float> tmax = Vector.Min(Vector.Max(t1, t2), Vector.Max(t3, t4));
Vector<int> lessThanZeroMask = Vector.GreaterThan(tmax, Constants.zeros);
Vector<int> greaterMask = Vector.LessThan(tmin, tmax);
Vector<int> combinedMask = Vector.BitwiseOr(lessThanZeroMask, greaterMask);
// ------------------------------------------------------- \\
// Keep track of the t's that collided. \\
t = Vector.ConditionalSelect(combinedMask, tmin, Constants.floatMax);
return combinedMask;
}
射线代码 -> 圆
public bool Intersect(Circle other)
{
// Step 0: Work everything out on paper!
// Step 1: Gather all the relevant data.
float ox = this.origin.X;
float dx = this.direction.X;
float oy = this.origin.Y;
float dy = this.direction.Y;
float x0 = other.origin.X;
float y0 = other.origin.Y;
float cr = other.radius;
// Step 2: compute the substitutions.
float p = ox - x0;
float q = oy - y0;
float r = 2 * p * dx;
float s = 2 * q * dy;
// Step 3: compute the substitutions, check if there is a collision.
float a = dx * dx + dy * dy;
float b = r + s;
float c = p * p + q * q - cr * cr;
float DSqrt = b * b - 4 * a * c;
// no collision possible! Commented out to make the benchmark more fair
//if (DSqrt < 0)
//{ return false; }
// Step 4: compute the substitutions.
float D = (float)Math.Sqrt(DSqrt);
float t0 = (-b + D) / (2 * a);
float t1 = (-b - D) / (2 * a);
float ti = Math.Min(t0, t1);
if(ti > 0 && ti < t)
{
t = ti;
return true;
}
return false;
}
RayPack -> CirclePack 代码
原始的、未经编辑的代码可以在以下位置找到:https://pastebin.com/87nYgZrv https://pastebin.com/87nYgZrv
public Vector<int> Intersect(CirclePack other)
{
// ------------------------------------------------------- \\
// Put all the data on the stack. \\
Vector<float> zeros = Constants.zeros;
Vector<float> twos = Constants.twos;
Vector<float> fours = Constants.fours;
// ------------------------------------------------------- \\
// Compute whether the ray collides with the circle. This \\
// is stored in the 'mask' vector. \\
Vector<float> p = this.origin.x - other.origin.x; ;
Vector<float> q = this.origin.y - other.origin.y;
Vector<float> r = twos * p * this.direction.x;
Vector<float> s = twos * q * this.direction.y; ;
Vector<float> a = this.direction.x * this.direction.x + this.direction.y * this.direction.y;
Vector<float> b = r + s;
Vector<float> c = p * p + q * q - other.radius * other.radius;
Vector<float> DSqrt = b * b - fours * a * c;
Vector<int> maskCollision = Vector.GreaterThan(DSqrt, zeros);
// Commented out to make the benchmark more fair.
//if (Vector.Dot(maskCollision, maskCollision) == 0)
//{ return maskCollision; }
// ------------------------------------------------------- \\
// Update t if and only if there is a collision. Take \\
// note of the conditional where we compute t. \\
Vector<float> D = Vector.SquareRoot(DSqrt);
Vector<float> bMinus = Vector.Negate(b);
Vector<float> twoA = twos * a;
Vector<float> t0 = (bMinus + D) / twoA;
Vector<float> t1 = (bMinus - D) / twoA;
Vector<float> tm = Vector.ConditionalSelect(Vector.LessThan(t1, t0), t1, t0);
Vector<int> maskBiggerThanZero = Vector.GreaterThan(tm, zeros);
Vector<int> maskSmallerThanT = Vector.LessThan(tm, this.t);
Vector<int> mask = Vector.BitwiseAnd(
maskCollision,
Vector.BitwiseAnd(
maskBiggerThanZero,
maskSmallerThanT)
);
this.t = Vector.ConditionalSelect(
mask, // the bit mask that allows us to choose.
tm, // the smallest of the t's.
t); // if the bit mask is false (0), then we get our original t.
return mask;
}
汇编代码
这些可以在pastebin上找到。请注意,基准测试工具中有一些样板组件。您需要查看函数调用。
- 边界框(包):https://pastebin.com/RYMQdZMh https://pastebin.com/RYMQdZMh
- 圆(包)调整:https://pastebin.com/YZHjc1vY https://pastebin.com/YZHjc1vY
- 圆(包)原图:https://pastebin.com/87nYgZrv https://pastebin.com/87nYgZrv
标杆管理
我一直在使用 BenchmarkDotNet 对情况进行基准测试。
Circle / CirclePack 的结果(更新):
Method |
Mean |
Error |
StdDev |
Intersection |
9.710 ms |
0.0540 ms |
0.0505 ms |
IntersectionPacked |
3.296 ms |
0.0055 ms |
0.0051 ms |
边界框/边界框填充的结果:
Method |
Mean |
Error |
StdDev |
Intersection |
24.269 ms |
0.2663 ms |
0.2491 ms |
IntersectionPacked |
1.152 ms |
0.0229 ms |
0.0264 ms |
由于 AVX,预计加速大约为 6 倍到 8 倍。边界框的加速比为重要的,而圆的加速比相当低。
重新审视顶部的问题:为什么一个加速比非常高,而另一个则相当低?两者中较低的一个(CirclePack)如何变得更快?
关于 Peter Cordes 的编辑(评论)
- 使基准测试更加公平:一旦光线不再发生碰撞,单光线版本就不会提前分支。现在加速大约是 2.5 倍。
- 添加汇编代码作为单独的标头。
- 关于平方根:这确实有影响,但没有看起来那么大。去除向量平方根可将总时间减少约 0.3ms。单射线代码现在也始终执行平方根。
- 关于 C# 中的 FMA(融合乘加)的问题。我认为它适用于标量(参见C# 可以使用乘加融合吗? https://stackoverflow.com/questions/37443569/can-c-sharp-make-use-of-fused-multiply-add),但我没有在 System.Numerics.Vector 结构中找到类似的操作。
- 关于压缩 min 的 C# 指令:是的,确实如此。愚蠢的我。我什至已经用过它了。