我尝试使用 Numba 来看看基数排序有多快。 Numba 获得良好性能的关键(通常)是写出所有循环,这非常有启发性。我最终得到以下结果:
from numba import jit
@jit
def radix_loop(nbatches, batch_m_bits, bitsums, a, out):
mask = (1 << batch_m_bits) - 1
for shift in range(0, nbatches*batch_m_bits, batch_m_bits):
# set bit sums to zero
for i in range(bitsums.shape[0]):
bitsums[i] = 0
# determine bit sums
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
bitsums[j] += 1
# take the cumsum of the bit sums
cumsum = 0
for i in range(bitsums.shape[0]):
temp = bitsums[i]
bitsums[i] = cumsum
cumsum += temp
# sorting loop
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
out[bitsums[j]] = a[i]
bitsums[j] += 1
# prepare next iteration
mask <<= batch_m_bits
# cant use `temp` here because of numba internal types
temp2 = a
a = out
out = temp2
return a
从 4 个内部循环中,很容易看出这是第四个循环,因此很难使用 Numpy 进行矢量化。
解决该问题的一种方法是从 Scipy 中引入特定的 C++ 函数:scipy.sparse.coo.coo_tocsr https://github.com/scipy/scipy/blob/maintenance/0.16.x/scipy/sparse/sparsetools/coo.h#L34-L78。它的内部循环与上面的 Python 函数几乎相同,因此可以滥用它在 Python 中编写更快的“向量化”基数排序。也许是这样的:
from scipy.sparse.coo import coo_tocsr
def radix_step(radix, keys, bitsums, a, w):
coo_tocsr(radix, 1, a.size, keys, a, a, bitsums, w, w)
return w, a
def scipysparse_radix_perbyte(a):
# coo_tocsr internally works with system int and upcasts
# anything else. We need to copy anyway to not mess with
# original array. Also take into account endianness...
a = a.astype('<i', copy=True)
bitlen = int(a.max()).bit_length()
radix = 256
work = np.empty_like(a)
_ = np.empty(radix+1, int)
for i in range((bitlen-1)//8 + 1):
keys = a.view('u1')[i::a.itemsize].astype(int)
a, work = radix_step(radix, keys, _, a, work)
return a
EDIT: Optimized the function a little bit.. see edit history.
其一是效率低下LSB https://en.wikipedia.org/wiki/Least_significant_bit像上面这样的基数排序是数组在 RAM 中被完全洗牌多次,这意味着 CPU 缓存没有得到很好的利用。为了尝试减轻这种影响,可以选择首先使用 MSB 基数排序进行一次传递,将项目放入大致正确的 RAM 块中,然后再使用 LSB 基数排序对每个结果组进行排序。这是一种实现:
def scipysparse_radix_hybrid(a, bbits=8, gbits=8):
"""
Parameters
----------
a : Array of non-negative integers to be sorted.
bbits : Number of bits in radix for LSB sorting.
gbits : Number of bits in radix for MSB grouping.
"""
a = a.copy()
bitlen = int(a.max()).bit_length()
work = np.empty_like(a)
# Group values by single iteration of MSB radix sort:
# Casting to np.int_ to get rid of python BigInt
ngroups = np.int_(2**gbits)
group_offset = np.empty(ngroups + 1, int)
shift = max(bitlen-gbits, 0)
a, work = radix_step(ngroups, a>>shift, group_offset, a, work)
bitlen = shift
if not bitlen:
return a
# LSB radix sort each group:
agroups = np.split(a, group_offset[1:-1])
# Mask off high bits to not undo the grouping..
gmask = (1 << shift) - 1
nbatch = (bitlen-1) // bbits + 1
radix = np.int_(2**bbits)
_ = np.empty(radix + 1, int)
for agi in agroups:
if not agi.size:
continue
mask = (radix - 1) & gmask
wgi = work[:agi.size]
for shift in range(0, nbatch*bbits, bbits):
keys = (agi & mask) >> shift
agi, wgi = radix_step(radix, keys, _, agi, wgi)
mask = (mask << bbits) & gmask
if nbatch % 2:
# Copy result back in to `a`
wgi[...] = agi
return a
时间(在我的系统上每个时间都有最佳性能设置):
def numba_radix(a, batch_m_bits=8):
a = a.copy()
bit_len = int(a.max()).bit_length()
nbatches = (bit_len-1)//batch_m_bits +1
work = np.zeros_like(a)
bitsums = np.zeros(2**batch_m_bits + 1, int)
srtd = radix_loop(nbatches, batch_m_bits, bitsums, a, work)
return srtd
a = np.random.randint(0, 1e8, 1e6)
%timeit numba_radix(a, 9)
# 10 loops, best of 3: 76.1 ms per loop
%timeit np.sort(a)
#10 loops, best of 3: 115 ms per loop
%timeit scipysparse_radix_perbyte(a)
#10 loops, best of 3: 95.2 ms per loop
%timeit scipysparse_radix_hybrid(a, 11, 6)
#10 loops, best of 3: 75.4 ms per loop
Numba 表现非常好,正如预期的那样。而且通过对现有 C 扩展的一些巧妙应用,有可能击败numpy.sort
。 IMO 在优化级别上,您已经得到了值得考虑的 Numpy 附加组件,但我不会真正考虑我的答案“向量化”中的实现:大部分工作是在外部完成的专用功能。
另一件令我印象深刻的事情是对基数选择的敏感性。对于我尝试的大多数设置,我的实现仍然比numpy.sort
,因此在实践中需要某种启发式方法才能全面提供良好的性能。