

在 C# 中,我想将双精度舍入到较低的精度,以便可以将它们存储在关联数组中不同大小的存储桶中。与通常的舍入不同,我想舍入到多个有效位。因此,大数字的绝对变化将比小数字变化大得多,但它们往往会按比例变化。因此,如果我想四舍五入到 10 个二进制数字,我会找到十个最高有效位,并将所有较低位清零,可能会添加一个小数字进行四舍五入。



  1. Find: zero-based index of the most significant binary digit set H.
  2. Compute: B = H - P, 
       where P is the number of significant digits of precision to round
       and B is the binary digit to start rounding, where B = 0 is the ones place, 
       B = 1 is the twos place, etc. 
  3. Add: x = x + 2^B 
       This will force a carry if necessary (we round halfway values up).
  4. Zero out: x = x mod 2^(B+1). 
       This clears the B place and all lower digits.

问题是找到一种有效的方法来找到最高位集。 如果我使用整数,有一些很酷的技巧可以找到 MSB。 如果可以的话,我不想打电话给 Round(Log2(x)) 。 该函数将被调用数百万次。


将双精度值舍入为(稍微)较低精度的好方法是什么? https://stackoverflow.com/questions/14150136/what-is-a-good-way-to-round-double-precision-values-to-a-somewhat-lower-precis

它适用于 C++。我正在使用 C#。



/// <summary>
/// Round numbers to a specified number of significant binary digits.
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
///      8    1000 => 1000   8
///      9    1001 => 1010  10
///     10    1010 => 1010  10
///     11    1011 => 1100  12
///     12    1100 => 1100  12
///     13    1101 => 1110  14
///     14    1110 => 1110  14
///     15    1111 =>10000  16
///     16   10000 =>10000  16
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
    // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
    var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
    ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
    var local_d = d;
        // double -> fixed point (sorta)
        ulong toLong = *(ulong*)(&local_d);
        // mask off your least-sig bits
        var modLong = toLong & significandMask;
        // fixed point -> float (sorta)
        local_d = *(double*)(&modLong);
    return local_d;

更新 2:Dekker 算法

感谢另一位受访者,我从 Dekker 的算法中得出了这一点。它四舍五入到最接近的值,而不是像上面的代码那样截断,并且它仅使用安全代码:

private static double[] PowersOfTwoPlusOne;

static NumericalAlgorithms()
    PowersOfTwoPlusOne = new double[54];
    for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
        if (i == 0)
            PowersOfTwoPlusOne[i] = 1; // Special case.
            long two_to_i_plus_one = (1L << i) + 1L;
            PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;

public static double AdjustPrecisionSafely(this double d, int digits)
    double t = d * PowersOfTwoPlusOne[53 - digits];
    double adjusted = t - (t - d);
    return adjusted;

更新 2:时间安排

我进行了测试,发现 Dekker 的算法速度快两倍!

不安全时间 = 1.922(秒)
安全时间 = 0.799(秒)

Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0 receives the high s-b bits, which is what you requested, and *x1 receives the remaining bits, which you may discard. In the code below, Scale should have the value 2b. If b is known at compile time, e.g., the constant 43, you can replace Scale with 0x1p43. Otherwise, you must produce 2b in some way.

这需要舍入到最近的模式。 IEEE 754 算法就足够了,但其他合理的算法也可以。它将平局舍入为偶数,这不是您所要求的(向上平局)。有必要吗?

这假设x * (Scale + 1)不溢出。运算必须以双精度(不大于)进行计算。

void Split(double *x0, double *x1, double x)
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;

