如何使用正则表达式合并数据框的多列?

2024-03-12

我有一个df如下

import pandas as pd

df = pd.DataFrame(
    {'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
     'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
     'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
     'number_C2_E2': [None, None, '3', None, None, None, None, None],
     'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
     'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
     'number_C3_E1': [None, None, '3', '4', None, None, None, None],
     'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
     'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
     }
)

我想要做的就是合并这些列,我们有两个规则:

  1. 如果一列删除_C{0~9} or _C{0~9}{0~9} or _C{0~9}{0~9}{0~9}等于另一列,这两列可以合并。

让我们来number_C1_E1 number_C2_E2 number_C3_E1举个例子,这里number_C1_E1 and number_C3_E1可以组合,因为它们都是number_E1 after removing _C{0~9}.

  1. 两个组合列应该去掉None values.

期望的结果是

  number_C1_1_E1 fruit_C11_1_E1 name_C111_1_E1 number_C2_1_E2 fruit_C22_1_E2 name_C222_1_E2
0              1          apple            tom           None           None           None
1              2         banana          jerry           None           None           None
2              3      blueberry        anthony              3      blueberry        anthony
3              4     strawberry          terry           None           None           None
4              5     watermelon           paul           None           None           None
5              6          peach         edward           None           None           None
6              7         orange         reggie           None           None           None
7              8          lemon       nicholas           None           None           None

有人有好的解决办法吗?


使用与上一个问题相同的方法,但还要为您的列计算重命名器:

group = df.columns.str.replace(r'_C\d+', '', regex=True)

names = df.columns.to_series().groupby(group).first()

out = (df.groupby(group, axis=1, sort=False).first()
         .rename(columns=names)
       )

选择:

group = df.columns.str.replace(r'_C\d+', '', regex=True)

out = (df.groupby(group, axis=1, sort=False).first()
         .set_axis(df.columns[~group.duplicated()], axis=1)
       )

Output:

  number_C1_E1 fruit_C11_E1 name_C111_E1 number_C2_E2 fruit_C22_E2 name_C222_E2
0            1        apple          tom         None         None         None
1            2       banana        jerry         None         None         None
2            3    blueberry      anthony            3    blueberry      anthony
3            4   strawberry        terry         None         None         None
4            5   watermelon         paul         None         None         None
5            6        peach       edward         None         None         None
6            7       orange       reggie         None         None         None
7            8        lemon     nicholas         None         None         None
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何使用正则表达式合并数据框的多列? 的相关文章

随机推荐