+= 对不存在的数据帧进行操作

2024-02-12

df_对:

city1   city2
0   sfo yyz
1   sfo yvr
2   sfo dfw
3   sfo ewr

df_pairs.to_dict('records') 的输出:

[{'city1': 'sfo', 'city2': 'yyz'},
 {'city1': 'sfo', 'city2': 'yvr'},
 {'city1': 'sfo', 'city2': 'dfw'},
 {'city1': 'sfo', 'city2': 'ewr'}]

data_df:

    city    2016-02-02 00:00:00 2016-02-05 00:00:00 2016-02-01 00:00:00 2016-02-04 00:00:00 2016-02-03 00:00:00
0   sfo -33.63  -62.34  -35.70  -31.84  -33.87
1   yyz -24.31  -51.17  -22.07  -31.00  -23.00
2   yvr -24.31  -51.17  -22.07  -31.00  -23.00
3   dfw -32.17  -43.77  -34.84  0.27    -11.49
4   ewr -28.87  -59.66  -28.40  -32.94  -29.06

data_df.to_dict('records') 的输出

[{'city': 'sfo',
  Timestamp('2016-02-02 00:00:00'): -33.63,
  Timestamp('2016-02-05 00:00:00'): -62.34,
  Timestamp('2016-02-01 00:00:00'): -35.7,
  Timestamp('2016-02-04 00:00:00'): -31.84,
  Timestamp('2016-02-03 00:00:00'): -33.87},
 {'city': 'yyz',
  Timestamp('2016-02-02 00:00:00'): -24.31,
  Timestamp('2016-02-05 00:00:00'): -51.17,
  Timestamp('2016-02-01 00:00:00'): -22.07,
  Timestamp('2016-02-04 00:00:00'): -31.0,
  Timestamp('2016-02-03 00:00:00'): -23.0},
 {'city': 'yvr',
  Timestamp('2016-02-02 00:00:00'): -24.31,
  Timestamp('2016-02-05 00:00:00'): -51.17,
  Timestamp('2016-02-01 00:00:00'): -22.07,
  Timestamp('2016-02-04 00:00:00'): -31.0,
  Timestamp('2016-02-03 00:00:00'): -23.0},
 {'city': 'dfw',
  Timestamp('2016-02-02 00:00:00'): -32.17,
  Timestamp('2016-02-05 00:00:00'): -43.77,
  Timestamp('2016-02-01 00:00:00'): -34.84,
  Timestamp('2016-02-04 00:00:00'): 0.27,
  Timestamp('2016-02-03 00:00:00'): -11.49},
 {'city': 'ewr',
  Timestamp('2016-02-02 00:00:00'): -28.87,
  Timestamp('2016-02-05 00:00:00'): -59.66,
  Timestamp('2016-02-01 00:00:00'): -28.4,
  Timestamp('2016-02-04 00:00:00'): -32.94,
  Timestamp('2016-02-03 00:00:00'): -29.06}]

所以我有一个名为df_pairs。对于每一对df_pairs,我想在中查找 city1 和 city2data_df,从另一个中减去一个,取差时间序列的符号,分离正负符号值,分离正差值和负差值,并计算 data_df 列中每一列的总和。

diff_df_sign_pos = diff_df_sign_neg = diff_df_pos = diff_df_neg = 0

for i in range(0,len(data_df.columns)):
    a = pd.merge(df_pairs[['city1','city2']], data_df.ix[:, [i]], left_on='city1', right_index=True, how='left').set_index(['city1', 'city2'])
    b = pd.merge(df_pairs[['city1','city2']], data_df.ix[:, [i]], left_on='city2', right_index=True, how='left').set_index(['city1', 'city2'])
    diff_df = b - a
    diff_df_sign = np.sign(diff_df)
    diff_df_sign_pos+= diff_df_sign.clip(lower=0)
    diff_df_sign_neg+= diff_df_sign.clip(upper=0)
    diff_df_pos+= diff_df.clip(lower=0)
    diff_df_neg+= diff_df.clip(upper=0)

如果运行上面的代码,您将看到最终值diff_df_sign_pos, diff_df_sign_neg, diff_df_pos and diff_df_neg是 NaN。

例如,最终结果为diff_df_sign_pos应该看起来像:

               2016-02-03 00:00:00
city1    city2  
sfo      yyz    5.0
         yvr    5.0
         dfw    5.0
         ewr    4.0

这告诉我们 yyz、yvr、dfw 和 sfo 之间的所有 5 个差异都是正的。


你为什么不简单地这样做:

df_city1 = pd.merge(df_pairs['city1'], data_df, left_on='city1', right_on='city', how='left')
df_city2 = pd.merge(df_pairs['city2'], data_df, left_on='city2', right_on='city', how='left')
diff = df_city2.subtract(df_city1, fill_value=0)
pos_sum = diff[diff >= 0].sum(axis=1)
neg_sum = diff[diff <  0].sum(axis=1)

不要循环遍历所有列,而是合并 2*(列数)次,更不用说索引了,然后使用复杂的位np.sign and .clip... 你的df_pairs and data_df是一一对应的吧?

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

+= 对不存在的数据帧进行操作 的相关文章

随机推荐