我想按周累计计算 pandas 框架中某一列的唯一值。例如,假设我有这样的数据:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
我想要的是每周的唯一 module_id 数量的运行计数,即如下所示:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
作为循环来执行此操作很简单,例如:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
但我的真实数据帧非常巨大,所以我想要一个矢量化算法而不是循环。
有一个听起来类似的问题here https://stackoverflow.com/q/35759120/575530,但看看已接受的答案(here https://stackoverflow.com/a/35759315/575530)原始发帖者不希望像我一样在日期之间累积唯一性。
我将如何在 pandas 中进行矢量化?