Pandas group-by 累加和的比例从0开始

2024-04-16

我有以下 pandas 数据框(没有最后 2 列):

   name        day   show-in-appointment    previous-missed-appointments   proportion-previous-missed
0  Jack   2020/01/01   show                              0                      0
1  Jack   2020/01/02   no-show                           0                      0
2  Jill   2020/01/02   no-show                           0                      0
3  Jack   2020/01/03   show                              1                      0.5
4  Jill   2020/01/03   show                              1                      1
5  Jill   2020/01/04   no-show                           1                      0.5
6  Jack   2020/01/04   show                              1                      0.33
7  Jill   2020/01/05   show                              2                      0.66
8  jack   2020/01/06   no-show                           1                      0.25
9  jack   2020/01/07   show                              2                 0.4>>>2(noshow)/5(noshow+show)
df = pd.DataFrame(
    data=np.asarray([
        ['Jack', 'Jack', 'Jill', 'Jack', 'Jill', 'Jill', 'Jack', 'Jill', 'jack', 'jack'],
        [
            '2020/01/01',
            '2020/01/02',
            '2020/01/02',
            '2020/01/03',
            '2020/01/03',
            '2020/01/04',
            '2020/01/04',
            '2020/01/05',
            '2020/01/06',
            '2020/01/07',
        ],
        ['show', 'no-show', 'no-show', 'show', 'show', 'no-show', 'show', 'show', 'no-show', 'show'],
    ]).T,
    columns=['name', 'day', 'show-in-appointment'],
)

previous-missed-appointments 列的创建方式如下代码所示:

df.name = df.name.str.capitalize()
df['order'] = df.index
df.day = pd.to_datetime(df.day)
df['noshow'] = df['show-in-appointment'].map({'show': 0, 'no-show': 1})
df = df.sort_values(by=['name', 'day'])
df['previous-missed-appointments'] = df.groupby('name').noshow.cumsum()
df.loc[df.noshow == 1, 'previous-missed-appointments'] -= 1
df = df.sort_values(by='order')
df = df.drop(columns=['noshow', 'order'])

********问题是

我怎样才能创建最后一栏???*********


您可以使用cumsum https://pandas.pydata.org/docs/reference/api/pandas.Series.cumsum.html and shift https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html in groupby.apply https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html对于第一列,然后除以groupby.cumcount https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.cumcount.html对于第二列:

# ensure dates are sorted
df = df.sort_values(by='day', key=lambda s: pd.to_datetime(s, dayfirst=False))

m = df['show-in-appointment'].eq('no-show')

g = m.groupby(df['name'].str.casefold(), group_keys=False)
df['previous-missed-appointments'] =  (
  g.apply(lambda x: x.cumsum().shift(fill_value=0))
 )

df['proportion-previous-missed'] = (
    df['previous-missed-appointments'].div(g.cumcount()).fillna(0)
)

print(df)

注意。重要的是,输入首先按日期(或名称/日期)排序。

Output:

   name         day show-in-appointment  previous-missed-appointments  proportion-previous-missed
0  Jack  2020/01/01                show                             0                    0.000000
1  Jack  2020/01/02             no-show                             0                    0.000000
2  Jill  2020/01/02             no-show                             0                    0.000000
3  Jack  2020/01/03                show                             1                    0.500000
4  Jill  2020/01/03                show                             1                    1.000000
5  Jill  2020/01/04             no-show                             1                    0.500000
6  Jack  2020/01/04                show                             1                    0.333333
7  Jill  2020/01/05                show                             2                    0.666667
8  jack  2020/01/06             no-show                             1                    0.250000
9  jack  2020/01/07                show                             2                    0.400000

中间体(为了清晰起见,名称组合在一起并使用较短的列名称):

   name         day     show  previous-missed computation  proportion
0  Jack  2020/01/01     show                0         0/0        0.00
1  Jack  2020/01/02  no-show                0         0/1        0.00
3  Jack  2020/01/03     show                1         1/2        0.50
6  Jack  2020/01/04     show                1         1/3        0.33
8  Jack  2020/01/06  no-show                1         1/4        0.25
9  Jack  2020/01/07     show                2         2/5        0.40
2  Jill  2020/01/02  no-show                0         0/0        0.00
4  Jill  2020/01/03     show                1         1/1        1.00
5  Jill  2020/01/04  no-show                1         1/2        0.50
7  Jill  2020/01/05     show                2         2/3        0.67
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Pandas group-by 累加和的比例从0开始 的相关文章

随机推荐