连续重叠
Treat "f_low"
值作为入口点并分配一个值1
。对待"f_high"
值作为退出点并分配一个值-1
。如果我们按升序处理所有值并累积分配的值,那么当累积值大于零时,我们将有一个重叠间隔。如果累计值达到零,我们就知道我们已经退出了任何重叠间隔。
NOTE:
这会将所有连续重叠的间隔分组。如果某个间隔与第一个间隔不重叠BUT与链中的最后一个重叠,则算作重叠。
我将为该解决方案下面的其他选项提供类似的解决方案。
尝试的例子
# 1 3 (Interval from 1 to 3)
# 2 5 (Interval from 2 to 5)
# 7 9 (Interval from 7 to 9)
# 1 1 -1 -1 1 -1 (Entry/Exit values)
# 1 2 1 0 1 0 (Accumulated values)
# ⇑ ⇑
# zero indicates leaving all overlaps
这表明一旦进入区间,我们就开始1
to 3
,我们不会保留所有重叠的间隔,直到我们到达5
区间的右侧2
to 5
如累计值达到零所示。
我将使用生成器返回具有重叠间隔的原始数据帧的索引列表。
归根结底,这应该是N * Log(N)
对于涉及的排序。
def gen_overlaps(df):
df = df.sort_values('f_low')
# get sorter lows and highs
a = df.to_numpy().ravel().argsort()
# get free un-sorter
b = np.empty_like(a)
b[a] = np.arange(len(a))
# get ones and negative ones
# to indicate entering into
# and exiting an interval
c = np.ones(df.shape, int) * [1, -1]
# if we sort by all values and
# accumulate when we enter and exit
# the accumulated value should be
# zero when there are no overlaps
d = c.ravel()[a].cumsum()[b].reshape(df.shape)
# ⇑ ⇑
# sort by value order unsort to get back to original order
indices = []
for i, indicator in zip(df.index, d[:, 1] == 0):
indices.append(i)
if indicator:
yield indices
indices = []
if indices:
yield indices
然后我会用pd.concat
组织它们来表达我的意思。k
is the kth
团体。有些组只有一个间隔。
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
1 2 0.485997 0.491911
2 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
6 0.666093 0.670438
3 7 0.765602 0.770028
8 0.766884 0.771307
4 9 0.775986 0.780398
5 10 0.794590 0.798965
如果我们只想要重叠的......
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
if len(i) > 1
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
2 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
6 0.666093 0.670438
3 7 0.765602 0.770028
8 0.766884 0.771307
仅与队列中的下一个间隔重叠
这是一个更简单的解决方案,并且符合 OP 所需的输出。
def gen_overlaps(df):
df = df.sort_values('f_low')
indices = []
cursor = None
for i, low, high in df.itertuples():
if not indices:
cursor = high
if low <= cursor:
indices.append(i)
else:
yield indices
indices = []
cursor = high
if len(indices) > 1:
yield indices
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
1 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
2 7 0.765602 0.770028
8 0.766884 0.771307