根据条件分割数据框

2024-04-06

我正在尝试将我的数据框分成两个基于medical_plan_id。如果为空，则进入df1。如果不空入df2.

df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]

下面的代码有效，但如果没有空字段，我的代码会引发TypeError("invalid type comparison").

df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]

遇到这样的情况该如何处理呢？

我的 df_with_medicalplanid 如下所示：

wellthie_issuer_identifier       ...       medical_plan_id
0                   UHC99806       ...                  None
1                   UHC99806       ...                  None

Use `==`, not `is`, 测试相等性

同样，使用!=代替is not为了不平等。

is在Python中有特殊的含义。它返回True如果两个变量指向同一个对象，而==检查变量引用的对象是否相等。也可以看看之间有区别吗== and is在Python中？ https://stackoverflow.com/questions/132988/is-there-a-difference-between-and-is-in-python.

不要重复掩码计算

您正在创建的布尔掩码是逻辑中最昂贵的部分。这也是你想要的逻辑avoid手动重复，因为您的第一个和第二个蒙版彼此相反。因此，您可以使用按位取反 https://stackoverflow.com/questions/8305199/the-tilde-operator-in-python ~（“波形符”），也可通过operator.invert https://docs.python.org/3/library/operator.html#operator.inv，否定现有掩码。

空字符串与空值不同

可以通过以下方式测试相等与空字符串== ''，但是相等与空值需要一个专门的方法：pd.Series.isnull https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html。这是因为空值在 Pandas 使用的 NumPy 数组中表示，通过np.nan, and np.nan != np.nan 按设计 https://stackoverflow.com/a/1573715/9209546.

如果你想用空值替换空字符串，你可以这样做：

df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)

从概念上讲，缺失值为 null 是有意义的（np.nan) 而不是空字符串。但与上述过程相反，即将 null 值转换为空字符串，也是可能的：

df['medical_plan_id'] = df['medical_plan_id'].fillna('')

如果差异很重要，您需要了解您的数据并应用适当的逻辑。

半最终解决方案

假设您确实有空值，请计算单个布尔掩码及其逆：

mask = df['medical_plan_id'].isnull()

df1 = df[mask]
df2 = df[~mask]

最终解决方案：避免额外的变量

作为程序员，您应该避免创建额外的变量。在这种情况下，不需要创建两个新变量，您可以使用GroupBy with dict给出数据帧的字典False (== 0) and True (== 1) 与您的掩码对应的键：

dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))

Then dfs[0]代表df2 and dfs[1]代表df1（也可以看看这个相关答案 https://stackoverflow.com/a/52947460/9209546）。上述的变体，您可以放弃字典构建并使用 PandasGroupBy方法：

dfs = df.groupby(df['medical_plan_id'].isnull())

dfs.get_group(0)  # equivalent to dfs[0] from dict solution
dfs.get_group(1)  # equivalent to dfs[1] from dict solution

Example

将以上所有内容付诸实践：

df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
                   'values': [1, 2, 3, 4, 5, 6, 7]})

df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))

print(dfs[0], dfs[1], sep='\n'*2)

   medical_plan_id  values
2           2134.0       3
3           4325.0       4
4           6543.0       5

   medical_plan_id  values
0              NaN       1
1              NaN       2
5              NaN       6
6              NaN       7

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

pandas