这可能不是获得最终结果的最佳方法,但只是想在这里展示这个想法。
- 首先,创建 DataFrame 并将时间戳转换为整数
from datetime import datetime
import pytz
from pytz import timezone
# Create DataFrame
START_DATE = datetime(2019,8,15,20,33,0)
test_df = pd.DataFrame({
'school_id': ['remote','remote','remote','remote','onsite','onsite','onsite','onsite','remote','remote'],
'class_id': ['green', 'green', 'red', 'red', 'green', 'green', 'green', 'green', 'red', 'green'],
'user_id': [15,15,16,16,15,17,17,17,16,17],
'status': [0,1,1,1,0,1,0,1,1,0],
'start': pd.date_range(start=START_DATE, periods=10, freq='2min')
})
# Convert TimeStamp to Integers
df = spark.createDataFrame(test_df)
print(df.dtypes)
df = df.withColumn('start', F.col('start').cast("bigint"))
df.show()
这输出:
+---------+--------+-------+------+----------+
|school_id|class_id|user_id|status| start|
+---------+--------+-------+------+----------+
| remote| green| 15| 0|1565915580|
| remote| green| 15| 1|1565915700|
| remote| red| 16| 1|1565915820|
| remote| red| 16| 1|1565915940|
| onsite| green| 15| 0|1565916060|
| onsite| green| 17| 1|1565916180|
| onsite| green| 17| 0|1565916300|
| onsite| green| 17| 1|1565916420|
| remote| red| 16| 1|1565916540|
| remote| green| 17| 0|1565916660|
+---------+--------+-------+------+----------+
- 创建您想要的时间序列
# Create time sequece needed
start = datetime.strptime('2019-08-15 20:30:00', '%Y-%m-%d %H:%M:%S')
eastern = timezone('US/Eastern')
start = eastern.localize(start)
times = pd.date_range(start = start, periods = 6, freq='5min')
times = [s.timestamp() for s in times]
print(times)
[1565915400.0, 1565915700.0, 1565916000.0, 1565916300.0, 1565916600.0, 1565916900.0]
- 最后,为每个组创建数据框
# Use pandas_udf to create final DataFrame
schm = StructType(df.schema.fields + [StructField('epoch', IntegerType(), True)])
@pandas_udf(schm, PandasUDFType.GROUPED_MAP)
def resample(pdf):
pddf = pd.DataFrame({'epoch':times})
pddf['school_id'] = pdf['school_id'][0]
pddf['class_id'] = pdf['class_id'][0]
pddf['user_id'] = pdf['user_id'][0]
res = np.searchsorted(times, pdf['start'])
arr = np.zeros(len(times))
arr[:] = np.nan
arr[res] = pdf['start']
pddf['status'] = arr
arr[:] = np.nan
arr[res] = pdf['status']
pddf['start'] = arr
return pddf
df = df.groupBy('school_id', 'class_id', 'user_id').apply(resample)
df = df.withColumn('timestamp', F.to_timestamp(df['epoch']))
df.show(60)
最终结果:
+---------+--------+-------+----------+-----+----------+-------------------+
|school_id|class_id|user_id| status|start| epoch| timestamp|
+---------+--------+-------+----------+-----+----------+-------------------+
| remote| red| 16| null| null|1565915400|2019-08-15 20:30:00|
| remote| red| 16| null| null|1565915700|2019-08-15 20:35:00|
| remote| red| 16|1565915940| 1|1565916000|2019-08-15 20:40:00|
| remote| red| 16| null| null|1565916300|2019-08-15 20:45:00|
| remote| red| 16|1565916540| 1|1565916600|2019-08-15 20:50:00|
| remote| red| 16| null| null|1565916900|2019-08-15 20:55:00|
| onsite| green| 15| null| null|1565915400|2019-08-15 20:30:00|
| onsite| green| 15| null| null|1565915700|2019-08-15 20:35:00|
| onsite| green| 15| null| null|1565916000|2019-08-15 20:40:00|
| onsite| green| 15|1565916060| 0|1565916300|2019-08-15 20:45:00|
| onsite| green| 15| null| null|1565916600|2019-08-15 20:50:00|
| onsite| green| 15| null| null|1565916900|2019-08-15 20:55:00|
| remote| green| 17| null| null|1565915400|2019-08-15 20:30:00|
| remote| green| 17| null| null|1565915700|2019-08-15 20:35:00|
| remote| green| 17| null| null|1565916000|2019-08-15 20:40:00|
| remote| green| 17| null| null|1565916300|2019-08-15 20:45:00|
| remote| green| 17| null| null|1565916600|2019-08-15 20:50:00|
| remote| green| 17|1565916660| 0|1565916900|2019-08-15 20:55:00|
| onsite| green| 17| null| null|1565915400|2019-08-15 20:30:00|
| onsite| green| 17| null| null|1565915700|2019-08-15 20:35:00|
| onsite| green| 17| null| null|1565916000|2019-08-15 20:40:00|
| onsite| green| 17|1565916180| 1|1565916300|2019-08-15 20:45:00|
| onsite| green| 17|1565916420| 1|1565916600|2019-08-15 20:50:00|
| onsite| green| 17| null| null|1565916900|2019-08-15 20:55:00|
| remote| green| 15| null| null|1565915400|2019-08-15 20:30:00|
| remote| green| 15|1565915580| 0|1565915700|2019-08-15 20:35:00|
| remote| green| 15| null| null|1565916000|2019-08-15 20:40:00|
| remote| green| 15| null| null|1565916300|2019-08-15 20:45:00|
| remote| green| 15| null| null|1565916600|2019-08-15 20:50:00|
| remote| green| 15| null| null|1565916900|2019-08-15 20:55:00|
+---------+--------+-------+----------+-----+----------+-------------------+
现在,您将获得每组 6 个时间戳。
请注意,并非所有原始的“状态”和“开始”都映射到最终的 DataFrame,这是因为在resample
udf,它发生在5minute
间隔,两个“开始”时间可以映射到同一时间网格点,您在这里会丢失一个。这可以在udf
根据您的频率以及您希望如何保存数据。