这里很大程度上可能取决于输出集的大小(反过来,这又取决于您采样的邻居的距离)。
If it's small, (no more than a few dozen items or so) your hand-rolled set implementation using std::vector
and std::find
will probably remain fairly competitive. Its problem is that it's an O(N2) algorithm -- each time you insert an item, you have to search all the existing items, so each insertion is linear on the number of items already in the set. Therefore, as the set grows larger, its time to insert items grows roughly quadratically.
Using std::set
you each insertion has to only do approximately log2(N) comparisons instead of N comparison. That reduces the overall complexity from O(N2) to O(N log N). The major shortcoming is that it's (at least normally) implemented as a tree built up of individually allocated nodes. That typically reduces its locality of reference -- i.e., each item you insert will consist of the data itself plus some pointers, and traversing the tree means following pointers around. Since they're allocated individually, chances are pretty good that nodes that are (currently) adjacent in the tree won't be adjacent in memory, so you'll see a fair number of cache misses. Bottom line: while its speed grows fairly slowly as the number of items increases, the constants involved are fairly large -- for a small number of items, it'll start out fairly slow (typically quite a bit slower than your hand-rolled version).
使用向量/排序/唯一结合了前面每种方法的一些优点。将项目存储在向量中(每个项目不需要额外的指针)通常会导致更好的缓存使用 - 相邻索引处的项目也位于相邻的内存位置,因此当您插入新项目时,新项目的位置可能会改变已经在缓存中了。主要缺点是,如果您正在处理一个非常大的集合,这可能会使用更多的内存。当您插入每个项目时,集合会消除重复项(即,只有与集合中已有的项目不同时才会插入项目),这将插入all项,然后最后删除所有重复项。给定当前内存可用性和邻居数量guess您可能正在访问,我怀疑这在实践中是一个主要缺点,但在错误的情况下,它可能会导致严重的问题 - 几乎任何虚拟内存的使用几乎肯定会造成净损失。
从复杂度的角度看最后一个,它的复杂度为 O(N log N),有点像集合。不同之处在于,对于该集合,它实际上更像是 O(N log M),其中 N 是邻居的总数,M 是唯一邻居的数量。对于向量,它实际上是 O(N log N),其中 N 是(同样)邻居的总数。因此,如果重复项的数量非常大,则一组可能具有显着的算法优势。
也可以在纯线性序列中实现类似集合的结构。这保留了集合仅存储唯一项目的优势,而且还保留了向量的参考位置优势。这个想法是保持当前集合的大部分已排序,因此您可以以 log(N) 复杂度搜索它。但是,当您插入新项目时,只需将其放入单独的向量(或现有向量的未排序部分)中。当您进行新插入时,您还会对那些未排序的项目进行线性搜索。
当未排序的部分变得太大(对于“太大”的某些定义)时,您对这些项目进行排序并将它们合并到主组中,然后再次开始相同的序列。如果您根据“log N”(其中 N 是排序组中的项目数)定义“太大”,则整个数据结构的复杂性可以保留 O(N log N)。当我使用它时,我发现未排序的部分可能比我在开始引起问题之前预期的要大。