Use groupby/transform
生成与原始 DataFrame 长度相同的列。这可以让你避免打电话pd.merge
.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Company': ['AcmeCorp', 'AcmeCorp', 'LolCorp', 'LolCorp'],
'Payment': [50.0, 50.0, 106, 94.00],
'Speciality': ['Roofing', 'Grounding', 'Roofing', 'Grounding']})
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
print(df)
yields
Company Payment Speciality percent
0 AcmeCorp 50.0 Roofing 0.50
1 AcmeCorp 50.0 Grounding 0.50
2 LolCorp 106.0 Roofing 0.53
3 LolCorp 94.0 Grounding 0.47
Although
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
可以简化为一行,
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
因为内置操作就像.transform('sum')
比具有自定义功能的功能更快(例如.transform(lambda x: x/x.sum())
),两行版本更快(特别是对于大型 DataFrame。)
当然,两行版本也可以写成
df['percent'] = df['Payment'] / df.groupby('Company')['Payment'].transform('sum')
速度没有损失,一个较少命名的变量,但可能有点难以阅读。
这是 100K 行 DataFrame 的基准测试:
In [53]: %timeit using_transform(df)
100 loops, best of 3: 8.5 ms per loop
In [54]: %timeit using_one_liner(df)
10 loops, best of 3: 20.2 ms per loop
In [55]: %timeit orig(df)
10 loops, best of 3: 30.2 ms per loop
这是用于执行基准测试的设置。
import numpy as np
import pandas as pd
N = 10**5
df = pd.DataFrame({'Company': np.random.choice(list('ABCD'), size=N),
'Payment': np.random.randint(10, size=N),
'Speciality': np.random.choice(list('XYZ'), size=N)})
def using_transform(df):
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
return df
def using_one_liner(df):
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
return df
def orig(df):
df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, left_on='Company', right_index=True, suffixes=('_Raw', '_Total'))
final_df['Percent of Total Payment'] = final_df['Payment_Raw'] / final_df['Payment_Total']
return final_df