我正在尝试获取学生专栏,根据他们的进度水平来统计活动。
Data looks like
STUDENT_ID STUDENT_ACTIVITY_SESSION_ID NODE_NAME ACTIVITY_NAME prog_level
FredID gobbledeegook1 Node1 MyActivity1 pass
FredID gobbledeegook2 Node1 MyActivity1 pass
FredID gobbledeegook3 Node2 MyActivity2 pass
JaniceID gobbledeegook4 Node3 MyActivity3 stay
JaniceID gobbledeegook5 Node3 MyActivity3 stay
JaniceID gobbledeegook5 Node3 MyActivity3 fail
Here is what I want:
STUDENT_ID attempts_pass attempts_fail attempts_stay
FredID 3
JaniceID 1 2
- 我尝试循环遍历,以便变量名称是自动的。我希望每一行都是一个 STUDENT_ID,计数是一列
def std_attempts_by_prog_level(df):
dict_fields = {}
df_by_prog_level = df.groupby('prog_level')['STUDENT_ACTIVITY_SESSION_ID']
for name, group in df_by_prog_level:
x = group.count()
dict_fields["attempts_" + name] = x
return pd.Series(dict_fields)
df.groupby('STUDENT_ID').apply(std_attempts_by_prog_level).reset_index()
result:
STUDENT_ID level_1 0
0 Fred attempts_cancel 104
1 Fred attempts_fail 96
2 Fred attempts_in_progress 30
...所以这需要进行旋转和混乱,所以我尝试从旋转方法中获取它
- 枢轴方法并手动命名字段:生成的多索引不会让我轻松地与其他学生指标合并回来
df_temp=df.groupby(['STUDENT_ID', 'prog_level'],as_index=False)['STUDENT_ACTIVITY_SESSION_ID'].count().pivot(index='STUDENT_ID', columns='prog_level').rename({'cancel':'attempts_cancel', 'fail':'attempts_fail', 'in_progress':'attempts_in_progress', 'pass':'attempts_pass'}, axis=1)
print(df_temp.columns)
result:
MultiIndex([('STUDENT_ACTIVITY_SESSION_ID', 'attempts_cancel'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_fail'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_in_progress'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_pass')],
names=[None, 'prog_level'])