我正在寻找一种更快的方法来提高解决方案的性能,以解决以下问题:某个 DataFrame 有两列,其中有一些列NaN他们身上的价值观。挑战在于取代这些NaNs带有来自辅助 DataFrame 的值。
下面我将分享用于实现我的方法的数据和代码。让我解释一下这个场景:merged_df
是原始的 DataFrame,有几列,其中一些有行NaN values:
从上图中可以看出,列day_of_week
and holiday_flg
是特别感兴趣的。我想填写NaN通过查看第二个名为的 DataFrame 来获取这些列的值date_info_df
,看起来像这样:
通过使用列中的值visit_date
in merged_df
可以搜索第二个 DataFramecalendar_date
并找到等效的匹配项。此方法允许获取以下值day_of_week
and holiday_flg
来自第二个数据帧。
本练习的最终结果是一个如下所示的 DataFrame:
你会注意到我使用的方法依赖于apply()
在每一行上执行自定义函数merged_df
:
- 对于每一行,搜索NaN值在
day_of_week
and holiday_flg
;
- When a NaN在这些列中的任何一列或两列中找到,请使用该行的可用日期
visit_date
在第二个 DataFrame 中找到等效匹配,特别是date_info_df['calendar_date']
column;
- 匹配成功后,值来自
date_info_df['day_of_week']
必须复制到merged_df['day_of_week']
以及来自的值date_info_df['holiday_flg']
还必须复制到date_info_df['holiday_flg']
.
这是一个工作源代码:
import math
import pandas as pd
import numpy as np
from IPython.display import display
### Data for df
data = { 'air_store_id': [ 'air_a1', 'air_a2', 'air_a3', 'air_a4' ],
'area_name': [ 'Tokyo', np.nan, np.nan, np.nan ],
'genre_name': [ 'Japanese', np.nan, np.nan, np.nan ],
'hpg_store_id': [ 'hpg_h1', np.nan, np.nan, np.nan ],
'latitude': [ 1234, np.nan, np.nan, np.nan ],
'longitude': [ 5678, np.nan, np.nan, np.nan ],
'reserve_datetime': [ '2017-04-22 11:00:00', np.nan, np.nan, np.nan ],
'reserve_visitors': [ 25, 35, 45, np.nan ],
'visit_datetime': [ '2017-05-23 12:00:00', np.nan, np.nan, np.nan ],
'visit_date': [ '2017-05-23' , '2017-05-24', '2017-05-25', '2017-05-27' ],
'day_of_week': [ 'Tuesday', 'Wednesday', np.nan, np.nan ],
'holiday_flg': [ 0, np.nan, np.nan, np.nan ]
}
merged_df = pd.DataFrame(data)
display(merged_df)
### Data for date_info_df
data = { 'calendar_date': [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ],
'day_of_week': [ 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday' ],
'holiday_flg': [ 0, 0, 0, 0, 1, 1 ]
}
date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
display(date_info_df)
# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
weekday = row['day_of_week']
holiday = row['holiday_flg']
# search dataframe date_info_df for the appropriate value when weekday is NaN
if (type(weekday) == float and math.isnan(weekday)):
search_date = row['visit_date']
#print(' --> weekday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
weekday = date_info_df.at[idx,'day_of_week']
#print(' --> weekday search_date=', search_date, 'is', weekday)
row['day_of_week'] = weekday
# search dataframe date_info_df for the appropriate value when holiday is NaN
if (type(holiday) == float and math.isnan(holiday)):
search_date = row['visit_date']
#print(' --> holiday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
holiday = date_info_df.at[idx,'holiday_flg']
#print(' --> holiday search_date=', search_date, 'is', holiday)
row['holiday_flg'] = int(holiday)
return row
# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1)
# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)
display(merged_df)
我做了一些测量,以便您可以理解其中的困难:
- 在 DataFrame 上6 rows,
apply()
takes 3.01 ms;
- 在 DataFrame 上使用 ~250000 rows,
apply()
takes 2min 51s.
- 在 DataFrame 上使用 ~1215000 rows,
apply()
takes 4min 2s.
我如何提高这项任务的表现?