在 Pandas 中使用 loc 和 MultiIndex DataFrame

2023-10-19

在之前的教程中，我们讨论了locproperty，一种基于标签的数据选择方法。

但是您是否知道在使用 loc 处理多级索引时可以优化您的选择？

本教程将揭示在 Pandas 中使用 loc 和 MultiIndex DataFrame 的强大功能。

目录 hide

1 loc 与 MultiIndex 的基本用法
2 从多个级别选择所有行
3 根据值选择数据
4 从排序的多索引数据框中选择数据
5 在指数级别应用布尔条件
6 使用 MultiIndex 将值分配给特定索引
7 从多索引数据框中选择行和列
8 使用多个条件进行复杂数据选择
9 处理多索引 DataFrame 中的缺失值

loc 与 MultiIndex 的基本用法

让我们首先导入必要的库并创建一些示例数据。


import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(0)

# Create a MultiIndex DataFrame
index = pd.MultiIndex.from_tuples([(i, j) for i in range(5) for j in range(5)])
df = pd.DataFrame(np.random.rand(25, 2), index=index)
df.columns = ['A', 'B']
print(df)

Output:


            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

让我们看看如何loc函数在处理 MultiIndexed DataFrame 时有效：


# Accessing a single row with loc
print(df.loc[(1, 3)])

# Accessing multiple rows
print(df.loc[[(1, 3), (2, 2)]])

Output:


A    0.020218
B    0.832620
Name: (1, 3), dtype: float64

            A         B
1 3  0.020218  0.832620
2 2  0.118274  0.639921

在第一个示例中，您使用元组对 DataFrame 进行索引：(1, 3).

这表示您希望第一级中索引为 1 的行和第二级中索引为 3 的行。输出是包含指定行数据的系列。

在第二个示例中，我们向 loc 属性提供一个元组列表，每个元组表示 DataFrame 中的不同行。

从多个级别选择所有行

让我们看看如何从多个级别选择所有行：


# Select all rows in level 1 with index 2 and level 2 with index 1
print(df.loc[(2, 1),])

Output:


A    0.461479
B    0.780529
Name: (2, 1), dtype: float64

当我们提供(2, 1)作为loc的key，它返回对应的行数据。这,（逗号）代表其他级别的所有元素。

So with (2, 1),，我们选择一级索引为 2 且二级索引为 1 的所有行。

让我们更进一步，利用 Pandas 中的切片操作。


# Select all rows where level 1 index is between 2 and 4
print(df.loc[(slice(2, 4), slice(None)), ])

Output:


            A         B
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

在这里，我们使用切片对象来定义选择范围。这将显示第一级索引介于 2 和 4 之间（含 2 和 4）的数据。

更远，slice(None)告诉 Pandas 选择第二层中的所有行。

根据值选择数据

如果您想根据值选择数据怎么办？

让我们看看如何做到这一点：


# Select rows where A > 0.5
df_A = df.loc[df['A'] > 0.5]
print(df_A)

Output:


            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  4  0.778157  0.870012
2 0  0.978618  0.799159
  4  0.521848  0.414662
3 3  0.612096  0.616934
  4  0.943748  0.681820
4 1  0.697631  0.060225
  2  0.666767  0.670638

输出数据帧，df_A，包括 A 列中的值大于 0.5 的所有行。

我们还可以添加多个条件：


# Loc to select rows where A > 0.5 and B < 0.3
df_AB = df.loc[(df['A'] > 0.5) & (df['B'] < 0.3)]
print(df_AB)

Output:


            A         B
4 1  0.697631  0.060225

当我们添加另一个条件时，选择会变得更加精确。现在，df_AB包括 A 列 > 0.5 且 B 列

可以使用“&”（与）或“|”（或）运算符组合多个条件。

从排序的多索引数据框中选择数据

如果 DataFrame 未排序，您可能会遇到不可预测的结果或错误。以下是对 MultiIndex DataFrame 进行排序的方法：


# Sort the DataFrame
df_sorted = df.sort_index()
print(df_sorted)

数据框df_sorted是按照索引排序的。

当DataFrame排序后，可以使用loc进行基于标签的切片操作：


# Select all rows where the first level index is between 2 and 4
df_sorted_range = df_sorted.loc[(slice(2, 4), slice(None)), ]
print(df_sorted_range)

Output:


            A         B
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

在这种情况下，df_sorted_range由第一级索引介于 2 和 4 之间（含 2 和 4）的行组成，类似于前面的示例。

在指数级别应用布尔条件

基于上面排序的 DataFrame，让我们看看如何在索引级别应用布尔条件：


# Select all rows with first-level index > 1 and second-level index < 3
df_bool_index = df_sorted.loc[(df_sorted.index.get_level_values(0) > 1) & 
                              (df_sorted.index.get_level_values(1) < 3)]
print(df_bool_index)

Output:


            A         B
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638

在这个例子中，该方法get_level_values(0) and get_level_values(1)分别用于访问第一和第二索引级别的值。

所结果的df_bool_index包含一级索引大于 1 且二级索引小于 3 的行。

使用 MultiIndex 将值分配给特定索引

您可以使用 loc 根据条件设置所有 DataFrame 值，或精确选择特定索引：


# Assign a value for a specific index
df_sorted.loc[(1, 3), 'A'] = 0.999
print(df_sorted.loc[(1, 3)])

Output:


A    0.99900
B    0.83262
Name: (1, 3), dtype: float64

我们使用 loc 设置索引处“A”列的值(1, 3)至 0.999。

您还可以为一系列索引级别分配值，如下所示：


# Assign a value for a range of index level
df_sorted.loc[(slice(2, 4), slice(None)), 'A'] = 0.123
print(df_sorted.loc[(slice(2, 4), slice(None)), ])

Output:


         A         B
2 0  0.123  0.799159
  1  0.123  0.780529
  2  0.123  0.639921
  3  0.123  0.944669
  4  0.123  0.414662
3 0  0.123  0.774234
  1  0.123  0.568434
  2  0.123  0.617635
  3  0.123  0.616934
  4  0.123  0.681820
4 0  0.123  0.437032
  1  0.123  0.060225
  2  0.123  0.670638
  3  0.123  0.128926
  4  0.123  0.363711

在此示例中，我们将第一级索引在 2 到 4 之间的列“A”的所有值设置为 0.123。

输出中对应于指定索引范围的每个 A 值现在反映了我们新分配的值。

从多索引数据框中选择行和列

您可以使用loc从多索引数据框中同时选择行和列中的数据。


# Select specific rows and columns
df_row_col = df_sorted.loc[(slice(None), slice(1, 3)), 'B']
print(df_row_col)

Output:


0  1    0.544883
   2    0.645894
   3    0.891773
1  1    0.925597
   2    0.087129
   3    0.832620
2  1    0.780529
   2    0.639921
   3    0.944669
3  1    0.568434
   2    0.617635
   3    0.616934
4  1    0.060225
   2    0.670638
   3    0.128926
Name: B, dtype: float64

对于行选择，slice(None)选择第一个索引级别中的所有行，并且slice(1, 3)选择第二级索引介于 1 和 3 之间（含 1 和 3）的行。

我们还选择了“B”列，这会产生一个系列。

您可以像这样选择多个列：


# Select all rows for multiple columns
df_multi_col = df_sorted.loc[(slice(None), slice(None)), ['A', 'B']]
print(df_multi_col)

Output:


            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

在此示例中，我们选择了所有行 (slice(None)）从两个索引级别，对于列选择，我们提供了列标签列表，['A', 'B'].

使用多个条件进行复杂数据选择

您可以使用多重条件从多索引 DataFrame 中选择数据：


# Select rows where A > 0.5 in level 1 index 2 and B < 0.3 in level 1 index 3
df_complex = df_sorted.loc[((df_sorted.index.get_level_values(0) == 2) & (df_sorted['A'] > 0.5)) |
                           ((df_sorted.index.get_level_values(0) == 3) & (df_sorted['B'] < 0.3))]
print(df_complex)

Output:


            A         B
2 0  0.978618  0.799159
  4  0.521848  0.414662

在这里，我们使用“管道”或“|”字符（翻译为“或”）组合多个条件。

位于括号内的“&”或“&”表示“和”。

因此，我们正在查找第一级索引为 2 且 A 列中的值大于 0.5 的行，或者第一级索引为 3 且 B 列中的值小于 0.3 的行。

处理多索引 DataFrame 中的缺失值

我们首先创建一个包含缺失值的 DataFrame。


df_missing = df_sorted.copy()
df_missing.loc[(2, 3), 'A'] = np.nan
df_missing.loc[(4, 0), 'B'] = np.nan
print(df_missing)

Output:


            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3       NaN  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508       NaN
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

在上面的数据框中df_missing，我们故意在索引处的“A”列中设置缺失值(2, 3)和索引处的“B”列(4, 0).

您可以使用notna()选择非缺失或缺失数据：


# Select rows where 'A' is not missing
df_A_not_missing = df_missing.loc[df_missing['A'].notna()]
print(df_A_not_missing)

Output:


            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508       NaN
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

数据框df_A_not_missing由原始 DataFrame 中不缺少 A 的所有行组成。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

pandas