作为特征工程的一部分,我想使用 groupby 之后的列计数作为模型的特征,这是我尝试过的
>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
col1 col2 col3
0 a val1 val3
1 b val2 val4
2 a val2 val3
3 c val1 val4
4 a val2 val3
5 b val2 val4
>>> test = df.groupby('col1').agg(list)
col2 col3
col1
a [val1, val2, val2] [val3, val3, val3]
b [val2, val2] [val4, val4]
c [val1] [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
col2 col3
col1
a {'val1': 1, 'val2': 2} {'val3': 3}
b {'val2': 2} {'val4': 2}
c {'val1': 1} {'val4': 1}
稍后我可以将字典扩展为单独的列,因此最终输出将是:
>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
我觉得有一个更简单的解决方案,感谢任何帮助。