我认为您首先需要列值的所有组合:
df = pd.DataFrame({'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
})
print (df)
A B C D
0 5 4 7 1
1 3 5 8 3
2 6 4 9 5
3 9 5 4 7
4 2 5 2 1
5 4 4 3 0
from itertools import combinations
a = df.columns
comb = [j for i in range(len(a), 0, -1) for j in combinations(a,i)]
print (comb)
[('A', 'B', 'C', 'D'),
('A', 'B', 'C'), ('A', 'B', 'D'), ('A', 'C', 'D'), ('B', 'C', 'D'),
('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D'),
('A',), ('B',), ('C',), ('D',)]
a = pd.concat([df.loc[:, x].sum(axis=1) for x in comb], axis=1)
print (a)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 17 16 10 13 12 9 12 6 11 5 8 5 4 7 1
1 19 16 11 14 16 8 11 6 13 8 11 3 5 8 3
2 24 19 15 20 18 10 15 11 13 9 14 6 4 9 5
3 25 18 21 20 16 14 13 16 9 12 11 9 5 4 7
4 10 9 8 5 8 7 4 3 7 6 3 2 5 2 1
5 11 11 8 7 7 8 7 4 7 4 3 4 4 3 0
然后通过以下方式获取所有重复项duplicated http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html with concat http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html并获得第一True
s by numpy.argmax https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html:
print (pd.concat([df.duplicated(x, keep=False) for x in comb], axis=1))
0 1 2 3 4 5 6 7 8 9 \
0 True True True True True True True True True True
1 True True True True True True True True True True
2 False False False False False True False False False False
3 False False False False False True False False False False
4 False False False False False False False False False False
10 11 12 13 14
0 True True True True True
1 True True True True True
2 False True True False False
3 False True True False False
4 False True False False True
a = pd.concat([df.duplicated(x, keep=False) for x in comb], axis=1).values.argmax(axis=1)
print (a)
[ 0 0 5 5 11]
最后使用这个数组作为参数groupby
:
df = df.groupby(a).sum()
print (df)
E F G
0 4 6 15
5 2 3 1
11 1 8 1