我有一个数据集,其中每个观察值都有权重,我想使用以下方法准备加权摘要groupby
但我对如何最好地做到这一点感到生疏。我认为这意味着自定义聚合函数。我的问题是如何正确处理不是按项目的数据,而是按组的数据。也许这意味着最好分步进行,而不是一次性完成。
在伪代码中,我正在寻找
#first, calculate weighted value
for each row:
weighted jobs = weight * jobs
#then, for each city, sum these weights and divide by the count (sum of weights)
for each city:
sum(weighted jobs)/sum(weight)
我不确定如何将“针对每个城市”部分放入自定义聚合函数中并访问组级别摘要。
模拟数据:
import pandas as pd
import numpy as np
np.random.seed(43)
## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low=5,high=40,size=N)
jobs = np.random.randint(low=1,high=20,size=N)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})