如何高效找到重叠区间？

2024-01-30

我有以下玩具示例数据框，df:

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977
   0.485997  0.491911
   0.503259  0.508679
   0.504687  0.510075
   0.504687  0.670075
   0.666093  0.670438
   0.765602  0.770028
   0.766884  0.771307
   0.775986  0.780398
   0.794590  0.798965

为了找到它的重叠子集，我使用以下代码：

df = df.sort_values('f_low')
for row in df.itertuples():
    iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
    span_range = pd.Interval(row.f_low, row.f_high)
    fx = df[(iix.overlaps(span_range))].copy()

我想获得这样的重叠数据框：

   # iteration 1: over row.f_low=0.476201  row.f_high=0.481915 

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977

   # iteration 2: over row.f_low=0.503259  row.f_high=0.508679 
      f_low    f_high
   0.503259  0.508679 
   0.504687  0.510075
   0.504687 0.670075

   # iteration 3: over row.f_low=0.504687  row.f_high=0.670075 
      f_low    f_high
   0.666093  0.670438

etc.

这很有效，但由于数据帧非常大并且有很多重叠，因此需要很长时间来处理。另外，我正在测试重叠的间隔在使用时不会抓住自己Interval and overlaps熊猫的方法。

这意味着一系列重叠的置信区间与迭代的每一行。

除了迭代所有元组之外，是否有一种方法可以更有效地提取给定间隔的重叠间隔？

这是未排序的实际数据帧块：

f_low   f_high
0.504687  0.670075
0.476201  0.481915
0.765602  0.770028
0.479161  0.484977
0.766884  0.771307
0.485997  0.491911
0.666093  0.670438
0.503259  0.508679
0.775986  0.780398
0.504687  0.510075
0.794590  0.798965

连续重叠

Treat "f_low"值作为入口点并分配一个值1。对待"f_high"值作为退出点并分配一个值-1。如果我们按升序处理所有值并累积分配的值，那么当累积值大于零时，我们将有一个重叠间隔。如果累计值达到零，我们就知道我们已经退出了任何重叠间隔。

NOTE:

这会将所有连续重叠的间隔分组。如果某个间隔与第一个间隔不重叠BUT与链中的最后一个重叠，则算作重叠。

我将为该解决方案下面的其他选项提供类似的解决方案。

尝试的例子

#  1     3                     (Interval from 1 to 3)
#     2        5               (Interval from 2 to 5)
#                    7     9   (Interval from 7 to 9)

#  1  1 -1    -1     1    -1   (Entry/Exit values)
#  1  2  1     0     1     0   (Accumulated values)
#              ⇑           ⇑
# zero indicates leaving all overlaps

这表明一旦进入区间，我们就开始1 to 3，我们不会保留所有重叠的间隔，直到我们到达5区间的右侧2 to 5如累计值达到零所示。

我将使用生成器返回具有重叠间隔的原始数据帧的索引列表。

归根结底，这应该是N * Log(N)对于涉及的排序。

def gen_overlaps(df):
    df = df.sort_values('f_low')
    
    # get sorter lows and highs
    a = df.to_numpy().ravel().argsort()
    
    # get free un-sorter
    b = np.empty_like(a)
    b[a] = np.arange(len(a))
    
    # get ones and negative ones
    # to indicate entering into
    # and exiting an interval
    c = np.ones(df.shape, int) * [1, -1]
    
    # if we sort by all values and
    # accumulate when we enter and exit
    # the accumulated value should be 
    # zero when there are no overlaps
    d = c.ravel()[a].cumsum()[b].reshape(df.shape)
    #             ⇑           ⇑
    # sort by value order     unsort to get back to original order
    
    indices = []
    for i, indicator in zip(df.index, d[:, 1] == 0):
        indices.append(i)
        if indicator:
            yield indices
            indices = []
    if indices:
        yield indices

然后我会用pd.concat组织它们来表达我的意思。k is the kth团体。有些组只有一个间隔。

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
})

         f_low    f_high
0 0   0.476201  0.481915
  1   0.479161  0.484977
1 2   0.485997  0.491911
2 3   0.503259  0.508679
  4   0.504687  0.510075
  5   0.504687  0.670075
  6   0.666093  0.670438
3 7   0.765602  0.770028
  8   0.766884  0.771307
4 9   0.775986  0.780398
5 10  0.794590  0.798965

如果我们只想要重叠的......

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
    if len(i) > 1
})

        f_low    f_high
0 0  0.476201  0.481915
  1  0.479161  0.484977
2 3  0.503259  0.508679
  4  0.504687  0.510075
  5  0.504687  0.670075
  6  0.666093  0.670438
3 7  0.765602  0.770028
  8  0.766884  0.771307

仅与队列中的下一个间隔重叠

这是一个更简单的解决方案，并且符合 OP 所需的输出。

def gen_overlaps(df):
    df = df.sort_values('f_low')
        
    indices = []
    cursor = None
    for i, low, high in df.itertuples():
        if not indices:
            cursor = high
        if low <= cursor:
            indices.append(i)
        else:
            yield indices
            indices = []
            cursor = high
    if len(indices) > 1:
        yield indices
    

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
})

        f_low    f_high
0 0  0.476201  0.481915
  1  0.479161  0.484977
1 3  0.503259  0.508679
  4  0.504687  0.510075
  5  0.504687  0.670075
2 7  0.765602  0.770028
  8  0.766884  0.771307

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

pandas

overlap