在研究了各种方法之后,最合适的方法是以下论文中提出的算法:
M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of (1 + 2 x) exp(x2) erfc x in 0 ≤ x < ∞." Mathematics of Computation, Volume 36, No. 153, January 1981, pp. 249-253 (online copy)
The basic idea of the paper is to create an approximation to (1 + 2 x) exp(x2) erfc(x), from which we can compute erfcx(x) by simply dividing by (1 + 2 x), and erfc(x) by then multiplying with exp(-x2). The tightly bounded range of the function, with function values roughly in [1, 1.3], and its general "flatness" lend itself well to polynomial approximation. Numerical properties of this approach are further improved by narrowing the approximation interval: the original argument x is transformed by q = (x - K) / (x + K), where K is a suitably chosen constant, followed by computing p (q), where p is a polynomial.
由于 erfc -x= 2 - erfcx,我们只需要考虑区间[0, ∞],通过这个变换映射到区间[-1, 1]。对于 IEEE-754 单精度,erfcf()
消失(变为零)x> 10.0546875,所以只需要考虑xε [0, 10.0546875)。对于这个范围,K 的“最佳”值是多少?我知道没有任何数学分析可以提供答案,论文根据实验建议 K = 3.75。
人们可以很容易地确定,对于单精度计算,9 次极小极大多项式近似足以满足该一般附近的各种 K 值。使用 Remez 算法系统地生成此类近似值,K 在 1.5 和 4 之间以 1/16 的步长变化,观察到 K = {2, 2.625, 3.3125} 时的近似误差最低。其中,K = 2 是最有利的选择,因为它有助于非常准确地计算 (x-K)/(x+ K),如图所示这个问题.
值 K = 2 且输入域为x建议有必要使用变体4我的答案,然而,一旦可以通过实验证明,较便宜的变体 5 在这里达到了相同的精度,这可能是由于近似函数的斜率非常浅q> -0.5,这会导致参数出现任何错误q大约减少十倍。
由于计算erfc()
除了初始近似之外,还需要后处理步骤,显然这两种计算的精度都必须很高,才能获得足够准确的最终结果。必须使用纠错技术。
One observes that the most significant coefficient in the polynomial approximation of (1 + 2 x) exp(x2) erfc(x) is of the form (1 + s), where s < 0.5. This means we can represent the leading coefficient more accurately by splitting off 1, and only using s in the polynomial. So instead of computing a polynomial p(q), then multiplying by the reciprocal r = 1 / (1 + 2 x), it is mathematically equivalent but numerically advantageous to compute the core approximation as p(q) + 1, and use p to compute fma (p, r, r)
.
通过计算初始商可以提高除法的准确性q从倒数r,计算残差e = p+1 - q* (1 + 2x)在 FMA 的帮助下,然后使用e应用修正q = q + (e * r),再次使用 FMA。
Exponentiation has error magnification properties, therefore computation of e-x2 must be performed carefully. The availability of FMA trivially allows the computation of -x2 as a double-float
shigh:slow. ex is its own derivative, so one can compute eshigh:slow as eshigh + eshigh * slow. This computation can be combined with the multiplication of the previous intermediate result r to yield r = r * eshigh + r * eshigh * slow. By use of FMA, one ensures that the most significant term r * eshigh is computed as accurately as possible.
将上述步骤与一些处理异常情况和负参数的简单选择相结合,得到以下 C 代码:
float my_expf (float);
/* Compute complementary error function.
*
* Based on: M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of
* (1+2x)exp(x^2)erfc x in 0 <= x < INF", Mathematics of Computation, Vol. 36,
* No. 153, January 1981, pp. 249-253.
*
* maximum error: 2.65184 ulps
*/
float my_erfcf (float x)
{
float a, d, e, p, q, r, s, t;
a = fabsf (x);
/* Compute q = (a-2)/(a+2) accurately. [0, 10.0546875] -> [-1, 0.66818] */
p = a + 2.0f;
r = 1.0f / p;
q = fmaf (-4.0f, r, 1.0f);
t = fmaf (q + 1.0f, -2.0f, a);
e = fmaf (-a, q, t);
q = fmaf (r, e, q);
/* Approximate (1+2*a)*exp(a*a)*erfc(a) as p(q)+1 for q in [-1, 0.66818] */
p = -0x1.a4a000p-12f; // -4.01139259e-4
p = fmaf (p, q, -0x1.42a260p-10f); // -1.23075210e-3
p = fmaf (p, q, 0x1.585714p-10f); // 1.31355342e-3
p = fmaf (p, q, 0x1.1adcc4p-07f); // 8.63227434e-3
p = fmaf (p, q, -0x1.081b82p-07f); // -8.05991981e-3
p = fmaf (p, q, -0x1.bc0b6ap-05f); // -5.42046614e-2
p = fmaf (p, q, 0x1.4ffc46p-03f); // 1.64055392e-1
p = fmaf (p, q, -0x1.540840p-03f); // -1.66031361e-1
p = fmaf (p, q, -0x1.7bf616p-04f); // -9.27639827e-2
p = fmaf (p, q, 0x1.1ba03ap-02f); // 2.76978403e-1
/* Divide (1+p) by (1+2*a) ==> exp(a*a)*erfc(a) */
d = fmaf (2.0f, a, 1.0f);
r = 1.0f / d;
q = fmaf (p, r, r); // q = (p+1)/(1+2*a)
e = fmaf (fmaf (q, -a, 0.5f), 2.0f, p - q); // residual: (p+1)-q*(1+2*a)
r = fmaf (e, r, q);
/* Multiply by exp(-a*a) ==> erfc(a) */
s = a * a;
e = my_expf (-s);
t = fmaf (-a, a, s);
r = fmaf (r, e, r * e * t);
/* Handle NaN, Inf arguments to erfc() */
if (!(a < INFINITY)) r = x + x;
/* Clamp result for large arguments */
if (a > 10.0546875f) r = 0.0f;
/* Handle negative arguments to erfc() */
if (x < 0.0f) r = 2.0f - r;
return r;
}
/* Compute exponential base e. Maximum ulp error = 0.86565 */
float my_expf (float a)
{
float c, f, r;
int i;
// exp(a) = exp(i + f); i = rint (a / log(2))
c = 0x1.800000p+23f; // 1.25829120e+7
r = fmaf (0x1.715476p+0f, a, c) - c; // 1.44269502e+0
f = fmaf (r, -0x1.62e400p-01f, a); // -6.93145752e-1 // log_2_hi
f = fmaf (r, -0x1.7f7d1cp-20f, f); // -1.42860677e-6 // log_2_lo
i = (int)r;
// approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
r = 0x1.694000p-10f; // 1.37805939e-3
r = fmaf (r, f, 0x1.125edcp-07f); // 8.37312452e-3
r = fmaf (r, f, 0x1.555b5ap-05f); // 4.16695364e-2
r = fmaf (r, f, 0x1.555450p-03f); // 1.66664720e-1
r = fmaf (r, f, 0x1.fffff6p-02f); // 4.99999851e-1
r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
// exp(a) = 2**i * exp(f);
r = ldexpf (r, i);
if (!(fabsf (a) < 104.0f)) {
r = a + a; // handle NaNs
if (a < 0.0f) r = 0.0f;
if (a > 0.0f) r = INFINITY;
}
return r;
}
我用我自己的实现expf()
在上面的代码中将我的工作与expf()
不同计算平台上的实现。但任何实施expf()
其最大误差接近 0.5 ulp 应该可以很好地工作。如上图,即使用时my_expf()
, my_erfcf()
最大错误为 2.65184 ulps。
提供了一个可向量化的expf()
如果可用,上面的代码应该可以毫无问题地进行矢量化。我用英特尔编译器 13.1.3.198 进行了快速检查。我打电话给my_erfcf()
在循环中添加#include <mathimf.h>
,替换了对my_expf()
打电话给expf()
,然后使用这些命令行开关进行编译:
/Qstd=c99 /O3 /QxCORE-AVX2 /fp:precise /Qfma /Qimf-precision:high:expf /Qvec_report=2
英特尔编译器报告该循环已矢量化,我通过检查反汇编的二进制代码进行了双重检查。
Since my_erfcf()
仅使用倒数而不是完全除法,可以使用快速倒数实现,前提是它们提供几乎正确舍入的结果。对于在硬件中提供快速单精度倒数近似的处理器,可以通过将其与三次收敛的 Halley 迭代相结合来轻松实现。针对 x86 处理器的此方法的(标量)示例是:
/* Compute 1.0f / a almost correctly rounded. Halley iteration with cubic convergence */
float fast_recipf (float a)
{
__m128 t;
float e, r;
t = _mm_set_ss (a);
t = _mm_rcp_ss (t);
_mm_store_ss (&r, t);
e = fmaf (r, -a, 1.0f);
e = fmaf (e, e, e);
r = fmaf (e, r, r);
return r;
}