尝试这个:
import re
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Output:
ID Status Date
0 1 F 2017-06-22
1 1 M 2017-07-22
2 1 P 2017-10-22
3 1 F 2018-06-22
4 1 P 2018-08-22
5 1 F 2018-10-22
6 1 F 2019-03-22
7 2 M 2017-06-29
8 2 F 2017-09-29
9 2 F 2018-01-29
10 2 M 2018-03-29
11 2 P 2018-08-29
12 2 M 2018-10-29
13 2 F 2018-12-29
14 3 M 2017-03-20
15 3 F 2018-06-20
16 3 P 2018-08-20
17 3 M 2018-10-20
18 3 F 2018-11-20
19 3 P 2018-12-20
20 3 F 2019-03-20
22 4 M 2017-08-10
23 4 F 2018-06-10
24 4 P 2018-08-10
25 4 F 2018-12-10
26 4 M 2019-01-10
27 4 F 2019-06-10
31 7 M 2017-08-10
32 7 F 2018-04-10
33 7 P 2018-08-10
34 7 F 2018-11-10
33 7 P 2019-08-10
34 7 F 2019-10-10
我的第一个技巧是使用正则表达式来查找较长字符串中子字符串的位置。
使用join
,我构建一个字符串并在该字符串中查找模式。通过将较长的图案放在前面来定义图案。
pattern = "FPFPF|FPF"
def f(x):
m = re.search(pattern, ''.join(x['Status']))
return x[m.start():m.end()] if m else None
df1 = df.groupby('ID', group_keys=False).apply(f)
print(df1)
Output:
ID Status Date
3 1 F 2018-06-22
4 1 P 2018-08-22
5 1 F 2018-10-22
18 3 F 2018-11-20
19 3 P 2018-12-20
20 3 F 2019-03-20
23 4 F 2018-06-10
24 4 P 2018-08-10
25 4 F 2018-12-10
32 7 F 2018-04-10
33 7 P 2018-08-10
34 7 F 2018-11-10
33 7 P 2019-08-10
34 7 F 2019-10-10
计算持续时间
df1['Duration'] = df1.groupby('ID')['Date'].diff().dt.days
print(df1)
Output:
ID Status Date Duration
3 1 F 2018-06-22 NaN
4 1 P 2018-08-22 61.0
5 1 F 2018-10-22 61.0
18 3 F 2018-11-20 NaN
19 3 P 2018-12-20 30.0
20 3 F 2019-03-20 90.0
23 4 F 2018-06-10 NaN
24 4 P 2018-08-10 61.0
25 4 F 2018-12-10 122.0
32 7 F 2018-04-10 NaN
33 7 P 2018-08-10 122.0
34 7 F 2018-11-10 92.0
33 7 P 2019-08-10 273.0
34 7 F 2019-10-10 61.0
使用最新的 pandas 0.25 进行聚合并进行聚合重新标记:
df_out = df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
.agg(F_P_Duration = lambda x: x.iloc[0],
F_F_Duration = 'sum').dropna()
print(df_out)
Output:
F_P_Duration F_F_Duration
ID Status
1 1 61.0 122.0
3 2 30.0 120.0
4 3 61.0 183.0
7 4 122.0 214.0
5 273.0 334.0
Pandas 0.25 之前的更新...
df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
.agg(['first', 'sum', 'last']).dropna()\
.rename(columns={'first':'F_P_Duration',
'sum':'F_F_Duration',
'last':'P_F_Duration'})
Output:
F_P_Duration F_F_Duration P_F_Duration
ID Status
1 1 61.0 122.0 61.0
3 2 30.0 120.0 90.0
4 3 61.0 183.0 122.0
7 4 122.0 214.0 92.0
5 273.0 334.0 61.0