我不确定这是否适合。它或多或少实现了您想要的,但实际上并不执行合并。它遵循与此相同的想法question https://stackoverflow.com/questions/33421551/how-to-merge-two-data-frames-based-on-nearest-date?lq=1除了而不是子集化df1
仅基于一列,这里我们使用groupby
并在两个数据帧上执行此操作。如果您确实想明确包含merge
命令并对内部联接感到满意,然后检查答案的最底部,它包含一个片段。
import pandas as pd
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, df2, groupname):
try:
match = df2.groupby(groupname).get_group(group.name)
match['date'] = pd.to_datetime(match.date, unit = 'D')
nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
dist, ind = nbrs.kneighbors(group['date'].values[:, None])
group['date1'] = group['date']
group['date'] = match['date'].values[ind.ravel()]
group['diff'] = (group['date1']-group['date'])
group['match_index'] = match.index[ind.ravel()]
return group
except KeyError:
return group
#change dates from string to datetime
df1['date'] = pd.to_datetime(df1.date, unit = 'D')
df2['date'] = pd.to_datetime(df2.date, unit = 'D')
#find closest dates and differences
keys = ['col1', 'col2', 'col3']
df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)
#fill unmatched dates
df1_mod.date1.fillna(df1_mod.date, inplace=True)
df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys)
df2_mod.date1.fillna(df2_mod.date, inplace=True)
#drop original column
df1_mod.drop('date', inplace=True, axis=1)
df1_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod.drop('date', inplace=True, axis=1)
df2_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod['diff'] = -df2_mod['diff']
#drop redundant values
df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)
#merge the two
df_final = pd.merge(df1_mod, df2_mod, how='outer')
这会产生以下结果:
In [349]: df_final
Out[349]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
使用合并命令:
In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
col1 col2 col3 date diff match_index
0 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
评论中考虑的案例,即:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
产生以下结果:
In [351]: df_final
Out[351]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 1432 dsa12 2 2016-05-20 NaT NaN
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN