我们可以尝试broadcasting https://numpy.org/doc/stable/user/basics.broadcasting.html:
import pandas as pd
df = pd.DataFrame([
[6, 5, 4, 3, 8], [6, 5, 4, 3, 6], [1, 1, 3, 9, 5],
[0, 1, 2, 7, 4], [2, 0, 0, 4, 0]
])
# Need to ensure only one of each row present since comparing to 1
# there needs to be one and only one of each row
df = df.drop_duplicates()
# Broadcasted comparison explanation below
cmp = (df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
# Filter using the results from the comparison
df = df[cmp]
df
:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
直觉:
通过 DataFrame 广播比较操作:
(df.values[:, None] <= df.values)
[[[ True True True True True]
[ True True True True False]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 8]
[[ True True True True True]
[ True True True True True]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 6]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[False True False False False]
[ True False False False False]] # df vs [1 1 3 9 5]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[ True True True True True]
[ True False False False False]] # df vs [0 1 2 7 4]
[[ True True True False True]
[ True True True False True]
[False True True True True]
[False True True True True]
[ True True True True True]]] # df vs [2 0 0 4 0]
然后我们可以检查all https://numpy.org/doc/stable/reference/generated/numpy.ndarray.all.html on axis=2
:
(df.values[:, None] <= df.values).all(axis=2)
[[ True False False False False] # Rows le [6 5 4 3 8]
[ True True False False False] # Rows le [6 5 4 3 6]
[False False True False False] # Rows le [1 1 3 9 5]
[False False True True False] # Rows le [0 1 2 7 4]
[False False False False True]] # Rows le [2 0 0 4 0]
然后我们可以使用sum https://numpy.org/doc/stable/reference/generated/numpy.ndarray.sum.html总计有多少行小于或等于:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1)
[1 2 1 2 1]
只有 1 行小于或等于(仅自匹配)的行是要保留的行。因为我们drop_duplicates https://pandas.pydata.org/docs/reference/api/pandas.Series.drop_duplicates.html数据框中不会有重复项,因此唯一的True
值将是自我匹配以及小于或等于的值:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
[ True False True False True]
然后,这将成为 DataFrame 的过滤器:
df = df[[True, False, True, False, True]]
df
:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0