关于外连接的默认/填充值

2024-04-08

以下是我正在使用的更大/复杂数据帧的微小/玩具版本:

>>> A
  key         u         v         w         x
0   a  0.757954  0.258917  0.404934  0.303313
1   b  0.583382  0.504687       NaN  0.618369
2   c       NaN  0.982785  0.902166       NaN
3   d  0.898838  0.472143       NaN  0.610887
4   e  0.966606  0.865310       NaN  0.548699
5   f       NaN  0.398824  0.668153       NaN

>>> B
  key         y         z
0   a  0.867603       NaN
1   b       NaN  0.191067
2   c  0.238616  0.803179
3   p  0.080446       NaN
4   q  0.932834       NaN
5   r  0.706561  0.814467

(FWIW,在这篇文章的末尾,我提供了生成这些数据帧的代码。)

I want to produce an outer join of these dataframes on the key column1, in such a way that the new positions induced by the outer join get default value 0.0. IOW, the desired result looks like this

  key         u         v         w         x         y         z
0   a  0.757954  0.258917  0.404934  0.303313  0.867603       NaN
1   b  0.583382  0.504687       NaN  0.618369       NaN  0.191067
2   c       NaN  0.982785  0.902166       NaN  0.238616  0.803179
3   d  0.898838  0.472143       NaN  0.610887  0.000000  0.000000
4   e  0.966606   0.86531       NaN  0.548699  0.000000  0.000000
5   f       NaN  0.398824  0.668153       NaN  0.000000  0.000000
6   p  0.000000  0.000000  0.000000  0.000000  0.080446       NaN
7   q  0.000000  0.000000  0.000000  0.000000  0.932834       NaN
8   r  0.000000  0.000000  0.000000  0.000000  0.706561  0.814467

(请注意,这个所需的输出包含一些 NaN,即那些已经存在于A or B.)

The merge方法让我到达那里,但填充的默认值是 NaN,而不是 0.0:

>>> C = pandas.DataFrame.merge(A, B, how='outer', on='key')
>>> C
  key         u         v         w         x         y         z
0   a  0.757954  0.258917  0.404934  0.303313  0.867603       NaN
1   b  0.583382  0.504687       NaN  0.618369       NaN  0.191067
2   c       NaN  0.982785  0.902166       NaN  0.238616  0.803179
3   d  0.898838  0.472143       NaN  0.610887       NaN       NaN
4   e  0.966606  0.865310       NaN  0.548699       NaN       NaN
5   f       NaN  0.398824  0.668153       NaN       NaN       NaN
6   p       NaN       NaN       NaN       NaN  0.080446       NaN
7   q       NaN       NaN       NaN       NaN  0.932834       NaN
8   r       NaN       NaN       NaN       NaN  0.706561  0.814467

The fillna方法无法产生所需的输出,因为它修改了一些应保持不变的位置:

>>> C.fillna(0.0)
  key         u         v         w         x         y         z
0   a  0.757954  0.258917  0.404934  0.303313  0.867603  0.000000
1   b  0.583382  0.504687  0.000000  0.618369  0.000000  0.191067
2   c  0.000000  0.982785  0.902166  0.000000  0.238616  0.803179
3   d  0.898838  0.472143  0.000000  0.610887  0.000000  0.000000
4   e  0.966606  0.865310  0.000000  0.548699  0.000000  0.000000
5   f  0.000000  0.398824  0.668153  0.000000  0.000000  0.000000
6   p  0.000000  0.000000  0.000000  0.000000  0.080446  0.000000
7   q  0.000000  0.000000  0.000000  0.000000  0.932834  0.000000
8   r  0.000000  0.000000  0.000000  0.000000  0.706561  0.814467

如何高效地达到预期的输出? (性能在这里很重要,因为我打算在比此处显示的数据帧大得多的数据帧上执行此操作。)


FWIW,下面是生成示例数据帧的代码A and B.

from pandas import DataFrame
from collections import OrderedDict
from random import random, seed

def make_dataframe(rows, colnames):
    return DataFrame(OrderedDict([(n, [row[i] for row in rows])
                                 for i, n in enumerate(colnames)]))

maybe_nan = lambda: float('nan') if random() < 0.4 else random()

seed(0)

A = make_dataframe([['a', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['b', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['c', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['d', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['e', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['f', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()]],
                   ('key', 'u', 'v', 'w', 'x'))

B = make_dataframe([['a', maybe_nan(), maybe_nan()],
                    ['b', maybe_nan(), maybe_nan()],
                    ['c', maybe_nan(), maybe_nan()],
                    ['p', maybe_nan(), maybe_nan()],
                    ['q', maybe_nan(), maybe_nan()],
                    ['r', maybe_nan(), maybe_nan()]],
                   ('key', 'y', 'z'))

1For for case of multi-key outer joins, see here https://stackoverflow.com/q/39751636/559827.


可以在后面补零merge:

res = pd.merge(A, B, how="outer")
res.loc[~res.key.isin(A.key), A.columns] = 0

EDIT

to skip key column:

res.loc[~res.key.isin(A.key), A.columns.drop("key")] = 0
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

关于外连接的默认/填充值 的相关文章

随机推荐