根据您的示例,您似乎已经同时对所有列进行了洗牌,这样就有一个映射的行索引向量A→B。这是一个玩具示例:
A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]
print(repr(A))
# array([[ 7, 11, 6],
# [ 4, 10, 8],
# [ 9, 2, 0],
# [ 1, 3, 5]])
print(repr(B))
# array([[ 1, 3, 5],
# [ 4, 10, 8],
# [ 7, 11, 6],
# [ 9, 2, 0]])
我们想要恢复一组索引,idx
,使得A[idx] == B
。这将是一个唯一的映射当且仅当A and B不包含重复的行。
一种有效的*方法是找到对中的行进行词法排序的索引A,然后找到每一行在哪里B将属于排序后的版本A. 一个有用的技巧 https://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array/16973510#16973510是查看A
and B
作为一维数组,使用np.void
将每一行视为单个元素的 dtype:
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a) # indices that sort the rows of A in lexical order
现在我们可以使用np.searchsorted http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.searchsorted.html对每行的位置执行二分搜索B将属于排序后的版本A:
# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)
The mapping from A→B can be expressed as a composite of A→As→B
a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True
If A and B不包含重复的行,逆映射为B→A也可以使用获得
b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True
作为单个函数:
def find_row_mapping(A, B):
"""
Given A and B, where B is a copy of A permuted over the first dimension, find
a set of indices idx such that A[idx] == B.
This is a unique mapping if and only if there are no repeated rows in A and B.
Arguments:
A, B: n-dimensional arrays with same shape and dtype
Returns:
idx: vector of indices into the rows of A
"""
if not (A.shape == B.shape):
raise ValueError('A and B must have the same shape')
if not (A.dtype == B.dtype):
raise TypeError('A and B must have the same dtype')
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a)
as_to_b = a.searchsorted(b, sorter=a_to_as)
return a_to_as.take(as_to_b)
基准:
In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop
*成本最高的步骤是对行进行快速排序,即O(n log n)一般。我不确定是否可以做得比这更好。