从另一个 DataFrame 填充 NaN 值(具有不同的形状)

2024-03-05

我正在寻找一种更快的方法来提高解决方案的性能,以解决以下问题:某个 DataFrame 有两列,其中有一些列NaN他们身上的价值观。挑战在于取代这些NaNs带有来自辅助 DataFrame 的值。

下面我将分享用于实现我的方法的数据和代码。让我解释一下这个场景:merged_df是原始的 DataFrame,有几列,其中一些有行NaN values:

从上图中可以看出,列day_of_week and holiday_flg是特别感兴趣的。我想填写NaN通过查看第二个名为的 DataFrame 来获取这些列的值date_info_df,看起来像这样:

通过使用列中的值visit_date in merged_df可以搜索第二个 DataFramecalendar_date并找到等效的匹配项。此方法允许获取以下值day_of_week and holiday_flg来自第二个数据帧。

本练习的最终结果是一个如下所示的 DataFrame:

你会注意到我使用的方法依赖于apply()在每一行上执行自定义函数merged_df:

  • 对于每一行,搜索NaN值在day_of_week and holiday_flg;
  • When a NaN在这些列中的任何一列或两列中找到,请使用该行的可用日期visit_date在第二个 DataFrame 中找到等效匹配,特别是date_info_df['calendar_date'] column;
  • 匹配成功后,值来自date_info_df['day_of_week']必须复制到merged_df['day_of_week']以及来自的值date_info_df['holiday_flg']还必须复制到date_info_df['holiday_flg'].

这是一个工作源代码:

import math
import pandas as pd
import numpy as np
from IPython.display import display

### Data for df
data = { 'air_store_id':     [              'air_a1',     'air_a2',     'air_a3',     'air_a4' ], 
         'area_name':        [               'Tokyo',       np.nan,       np.nan,       np.nan ], 
         'genre_name':       [            'Japanese',       np.nan,       np.nan,       np.nan ], 
         'hpg_store_id':     [              'hpg_h1',       np.nan,       np.nan,       np.nan ],          
         'latitude':         [                  1234,       np.nan,       np.nan,       np.nan ], 
         'longitude':        [                  5678,       np.nan,       np.nan,       np.nan ],         
         'reserve_datetime': [ '2017-04-22 11:00:00',       np.nan,       np.nan,       np.nan ], 
         'reserve_visitors': [                    25,           35,           45,       np.nan ], 
         'visit_datetime':   [ '2017-05-23 12:00:00',       np.nan,       np.nan,       np.nan ], 
         'visit_date':       [ '2017-05-23'         , '2017-05-24', '2017-05-25', '2017-05-27' ],
         'day_of_week':      [             'Tuesday',  'Wednesday',       np.nan,       np.nan ],
         'holiday_flg':      [                     0,       np.nan,       np.nan,       np.nan ]
       }

merged_df = pd.DataFrame(data)
display(merged_df)

### Data for date_info_df
data = { 'calendar_date':     [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ], 
         'day_of_week':       [    'Tuesday',  'Wednesday',   'Thursday',     'Friday',   'Saturday',     'Sunday' ], 
         'holiday_flg':       [            0,            0,            0,            0,            1,            1 ]         
       }

date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 
display(date_info_df)

# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
    weekday = row['day_of_week']   
    holiday = row['holiday_flg']

    # search dataframe date_info_df for the appropriate value when weekday is NaN
    if (type(weekday) == float and math.isnan(weekday)):
        search_date = row['visit_date']                               
        #print('  --> weekday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        weekday = date_info_df.at[idx,'day_of_week']
        #print('  --> weekday search_date=', search_date, 'is', weekday)        
        row['day_of_week'] = weekday        

    # search dataframe date_info_df for the appropriate value when holiday is NaN
    if (type(holiday) == float and math.isnan(holiday)):
        search_date = row['visit_date']                               
        #print('  --> holiday search_date=', search_date, 'type=', type(search_date))        
        indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
        idx = indexes[0]                
        holiday = date_info_df.at[idx,'holiday_flg']
        #print('  --> holiday search_date=', search_date, 'is', holiday)        
        row['holiday_flg'] = int(holiday)

    return row


# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1) 

# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)

display(merged_df)

我做了一些测量,以便您可以理解其中的困难:

  • 在 DataFrame 上6 rows, apply() takes 3.01 ms;
  • 在 DataFrame 上使用 ~250000 rows, apply() takes 2min 51s.
  • 在 DataFrame 上使用 ~1215000 rows, apply() takes 4min 2s.

我如何提高这项任务的表现?


您可以使用Index要加快查找速度,请使用combine_first()填充 NaN:

cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
    date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))

print(merged_df[cols])

结果:

 day_of_week  holiday_flg
0     Tuesday          0.0
1   Wednesday          0.0
2    Thursday          0.0
3    Saturday          1.0
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

从另一个 DataFrame 填充 NaN 值(具有不同的形状) 的相关文章

随机推荐