Pandas isin 方法：Python 中的高效数据过滤

2023-10-21

The isin中的方法Pandas用于过滤 DataFrame 和 Series，它允许您选择一列（或多列）包含特定值的行。

在本教程中，我们将探讨其语法和参数、过滤行的基本用法、使用字典和集合进行查找、处理多个条件等等。

目录 hide

1 Pandas isin() 语法和参数
2 使用 isin 过滤包含值列表的行
3 处理多种条件
4 使用 isin 进行不区分大小写的过滤
5 将 isin 与多列字典结合使用
6 isin 输入的动态列表创建
7 将 isin 与另一个 DataFrame 一起使用
8 处理空值或缺失值
9 为什么 isin 查找速度超快？
10 Chaining with loc and iloc
11 使用 with where 方法
12 isin 与直接比较
13 实际示例（使用 isin 的查找逻辑）
14 Resource

Pandas isin() 语法和参数

这是基本语法isin应用于系列时的方法：


Series.isin(values)

对于数据框：


DataFrame.isin(values)

Where: values可以是列表、系列或数据帧。

参数：

values：要搜索的值的序列。这可以是一个列表、一个系列，甚至是另一个 DataFrame。

该方法返回一个布尔系列或数据帧（与原始数据形状相同），然后可用于从原始数据中过滤出行/列。

使用 isin 来使用值列表过滤行

假设您有一个包含“水果”列的 DataFrame，并且您想要过滤水果为“苹果”或“香蕉”的行。


import pandas as pd
df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'],
    'Price': [1, 0.5, 2, 1.5, 3, 1.2, 0.6, 2.1]
})

# Filtering rows with isin method
filtered_df = df[df['Fruit'].isin(['Apple', 'Banana'])]
print(filtered_df)

Output:


    Fruit  Price
0   Apple    1.0
1  Banana    0.5
5   Apple    1.2
6  Banana    0.6

通过使用isin方法，您已成功过滤水果为“Apple”或“Banana”的行。

处理多种条件

结合isin方法与其他条件操作一起使用& (and), |（或），以及~（非）运算符允许您构建复杂的过滤条件。

以下是如何使用它们来管理多种条件：

使用 &（AND 运算符）

假设您要过滤 DataFrame 中“Fruit”为“Apple”或“Banana”的行AND“价格”小于 1。


df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'],
    'Price': [1, 0.5, 2, 1.5, 3, 1.2, 0.6, 2.1]
})
filtered_df = df[df['Fruit'].isin(['Apple', 'Banana']) & (df['Price'] < 1)]
print(filtered_df)

Output:


    Fruit  Price
1  Banana    0.5
6  Banana    0.6

在这里，您仅获得水果为“苹果”或“香蕉”且价格低于 1 的行。

使用 | （或运算符）

现在，让我们过滤“Fruit”是“Apple”的行OR“价格”大于 2。


filtered_df = df[df['Fruit'].isin(['Apple']) | (df['Price'] > 2)]
print(filtered_df)

Output:


    Fruit  Price
0   Apple    1.0
4     Fig    3.0
5   Apple    1.2
7  Cherry    2.1

生成的 DataFrame 包含水果为“Apple”或价格大于 2 的行。

使用 ~（非运算符）

获取“Fruit”所在的行NOT“苹果”或“香蕉”：


filtered_df = df[~df['Fruit'].isin(['Apple', 'Banana'])]
print(filtered_df)

Output:


    Fruit  Price
2  Cherry    2.0
3    Date    1.5
4     Fig    3.0
7  Cherry    2.1

通过使用~运算符，您将排除水果为“Apple”或“Banana”的所有行。

使用 isin 进行不区分大小写的过滤

为了确保查找不区分大小写，您可以在应用之前将过滤器列表和 DataFrame 列转换为小写（或大写）isin method:


filter_fruits = ['APPLE', 'date']
filtered_df = df[df['Fruit'].str.lower().isin([fruit.lower() for fruit in filter_fruits])]
print(filtered_df)

Output:


   Fruit  Price
0  Apple    1.0
3   Date    1.5
5  Apple    1.2

通过这种方法，filtered_df将捕获包含水果“Apple”、“Date”和第二次出现“Apple”的行，无论 DataFrame 和过滤器列表之间的大小写不匹配。

注意：对于大型 DataFrame，跨整列转换文本案例可能需要大量计算。

将 isin 与多列字典结合使用

给定一个字典，其中键是列名，相应的值是该列的可接受值的列表，isin方法将返回布尔值的 DataFrame。

以下是如何使用此技术：


df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'],
    'Color': ['Red', 'Yellow', 'Red', 'Brown', 'Purple', 'Green', 'Yellow', 'Red']
})

# Using isin with a dictionary
criteria = {
    'Fruit': ['Apple', 'Banana'],
    'Color': ['Red', 'Yellow']
}
filtered_mask = df.isin(criteria)
print(filtered_mask)

Output:


   Fruit  Color
0   True   True
1   True   True
2  False   True
3  False  False
4  False  False
5   True  False
6   True   True
7  False   True

从上面的布尔 DataFrame 中，您可以看到每个单元格的值满足字典中指定的条件的位置。例如，第一行（索引 0）满足两个条件：“Fruit”是“Apple”，“Color”是“Red”。

基于多列标准的过滤

要过滤“Fruit”和“Color”列都满足各自条件的原始 DataFrame，您可以组合布尔值：


final_filtered_df = df[filtered_mask.all(axis=1)]
print(final_filtered_df)

Output:


    Fruit   Color
0   Apple     Red
1  Banana  Yellow
6  Banana  Yellow

The all(axis=1)方法检查是否所有条件（列）都满足True对于给定的行，生成的 DataFrame 仅包含满足以下条件的行both列。

isin 输入的动态列表创建

您可以根据用户输入创建过滤器。给定以下数据框：


sales_df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry', 'Papaya', 'Apple', 'Banana'],
    'Units': [100, 150, 80, 90, 50, 110, 140]
})

为了获得更多互动性，您可以询问用户他们对哪些水果感兴趣：


# Asking user for input
user_fruits = input("Enter the names of fruits separated by commas: ").split(",")

# Stripping potential white spaces and converting to list
user_fruits_list = [fruit.strip() for fruit in user_fruits]

# Filtering based on user input
filtered_sales = sales_df[sales_df['Fruit'].isin(user_fruits_list)]
print(filtered_sales)

将 isin 与另一个 DataFrame 一起使用

您可以使用isin根据另一个 DataFrame 中的值过滤一个 DataFrame 中的值。

让我们创建两个 DataFrame：


import pandas as pd
data = {
    'ID': [101, 102, 103, 104, 105],
    'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig']
}
df = pd.DataFrame(data)

filter_data = {
    'Product': ['Apple', 'Date']
}
filter_df = pd.DataFrame(filter_data)

过滤主 DataFrame (df）基于过滤器 DataFrame 中的值（filter_df):


filtered_by_df = df[df['Product'].isin(filter_df['Product'])]
print(filtered_by_df)

Output:


    ID Product
0  101   Apple
3  104    Date

如果您正在动态生成过滤器或者它们经常发生更改，则更新单独的过滤器 DataFrame 比手动调整列表或条件更容易。

处理空值或缺失值

使用时isin对于包含 NaN 或 None 的列表，它可以匹配和过滤具有缺失值的行。


df = pd.DataFrame({
    'Fruit': ['Apple', None, 'Cherry', 'Date', 'Fig', 'Apple', None, 'Cherry']
})

# Filtering rows using isin with None
filtered_df = df[df['Fruit'].isin([None])]
print(filtered_df)

Output:


  Fruit
1  None
6  None

将 isin 与 notna() 结合起来

您可以在使用时过滤掉缺失值isin通过链接notna() method.


fruits_list = ['Apple', None]
filtered_df = df[df['Fruit'].isin(fruits_list) & df['Fruit'].notna()]
print(filtered_df)

Output:


    Fruit
0   Apple
5   Apple

通过添加notna()方法，确保结果不包含任何缺失值，即使列表包含 None 或 NaN。

将 isna() 与 isin() 一起使用

你可以结合isin with isna。如果您想获取特定值或缺失值：


fruits_list = ['Apple', 'Cherry']
filtered_df = df[df['Fruit'].isin(fruits_list) | df['Fruit'].isna()]
print(filtered_df)

Output:


    Fruit
0   Apple
1    None
2  Cherry
5   Apple
6    None
7  Cherry

通过结合isin with isna，生成的 DataFrame 包含水果为“Apple”或“Cherry”或具有缺失值的行。

为什么 isin 查找速度超快？

The isin方法通过在检查成员资格时将输入可迭代转换为集合来内部优化操作。

这是因为检查集合中的成员资格平均为 O(1)，而检查列表中的成员资格平均为 O(n)。

为了演示列表和集合之间的性能差异，您应该避免使用isin方法，因为它已经使用了集合。

相反，您可以使用 Python 原生的in运算符检查循环中的成员资格。


import pandas as pd
import time

df = pd.DataFrame({
    'Value': list(range(1, 100001))
})

# Defining a list and a set
lookup_list = list(range(50001, 150001))
lookup_set = set(lookup_list)

# Time taken using list with native in operator
start_time = time.time()
df_list = df[df['Value'].apply(lambda x: x in lookup_list)]
end_time = time.time()
print(f"Time with list: {end_time - start_time} seconds")

# Time taken using set with native in operator
start_time = time.time()
df_set = df[df['Value'].apply(lambda x: x in lookup_set)]
end_time = time.time()
print(f"Time with set: {end_time - start_time} seconds")

Output:


Time with list: 101.42362904548645 seconds
Time with set: 0.03318667411804199 seconds

该集合比列表快得多。这就是为什么isin速度超级快。

与 loc 和 iloc 链接

Both loc and iloc是 Pandas 中的索引函数。

将 isin 与 loc 一起使用

loc允许您通过标签访问一组行和列。当与isin，您可以根据某些条件过滤行，同时也可以选择特定列。

考虑一个示例数据帧：


df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry'],
    'Price': [0.5, 0.3, 1.0, 0.8],
    'Quantity': [100, 150, 50, 80]
})

要过滤水果为“Apple”或“Mango”的行，并且仅选择“价格”列：


selected_prices = df.loc[df['Fruit'].isin(['Apple', 'Mango']), 'Price']
print(selected_prices)

Output:


0    0.5
2    1.0
Name: Price, dtype: float64

将 isin 与 iloc 一起使用

While iloc主要用于基于整数的位置索引，您可以组合布尔掩码，例如由isin与列整数索引以获得所需的选择。

要获得与前面的示例相同的结果，请使用iloc:


row_mask = df['Fruit'].isin(['Apple', 'Mango'])
selected_prices = df.iloc[row_mask.values, 1]
print(selected_prices)

Output:


0    0.5
2    1.0
Name: Price, dtype: float64

请记住，在使用时iloc，您需要使用以下方法将布尔系列（掩码）转换为其值：.values属性。

使用 isin 和 loc 修改值

您还可以使用isin with loc修改 DataFrame 中的特定条目。

假设您想给予折扣，将“Apple”和“Mango”的价格降低 10%：


df.loc[df['Fruit'].isin(['Apple', 'Mango']), 'Price'] *= 0.9
print(df)

Output:


    Fruit  Price  Quantity
0   Apple   0.45       100
1  Banana   0.30       150
2   Mango   0.90        50
3  Cherry   0.80        80

请注意“Apple”和“Mango”的价格是如何修改的。

使用 with where 方法

当与isin, the 哪里方法允许表达条件选择。

考虑一个示例数据帧：


import pandas as pd
df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry', 'Grape'],
    'Price': [0.5, 0.3, 1.0, 0.8, 0.6]
})

要保留 Fruit 为“Apple”或“Mango”的行的值，并将其他行替换为 NaN：


filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']))
print(filtered_df)

Output:


    Fruit  Price
0   Apple    0.5
1     NaN    NaN
2   Mango    1.0
3     NaN    NaN
4     NaN    NaN

默认情况下，where将不匹配的行替换为 NaN。您可以指定不同的值，甚至可以指定替换值的 DataFrame。

例如，将不匹配的水果替换为“未选择”，并将其价格替换为 0：


replacement_df = pd.DataFrame({
    'Fruit': ['Not Selected'] * len(df),
    'Price': [0] * len(df)
})
filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']), other=replacement_df)
print(filtered_df)

Output:


         Fruit  Price
0        Apple    0.5
1  Not Selected    0.0
2        Mango    1.0
3  Not Selected    0.0
4  Not Selected    0.0

将 where 和 isin 与多个条件结合起来

假设我们想要保留“Apple”或“Mango”水果，但前提是它们的价格大于 0.4：


filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']) & (df['Price'] > 0.4))
print(filtered_df)

Output:


   Fruit  Price
0  Apple    0.5
1    NaN    NaN
2  Mango    1.0
3    NaN    NaN
4    NaN    NaN

isin 与直接比较

对于单个值或一小组值，使用逻辑运算符 (==, !=, &, |）更加高效和直接。

随着要检查的值数量的增加，使用多个直接比较可能会变得麻烦且可读性较差。这是哪里isin进来：


fruits_list = ['Apple', 'Mango', 'Grape', 'Peach', 'Strawberry', ...]  # a long list
filtered_df = df[df['Fruit'].isin(fruits_list)]

让我们进行一个简单的速度测试，看看哪个速度更快：


import pandas as pd
import numpy as np
import timeit

# Create a sample dataframe
np.random.seed(42)
n = 10**6  # Number of rows in the DataFrame
df = pd.DataFrame({
    'A': np.random.randint(0, 1000, n)
})

# Some values to be checked
values_to_check = list(range(500, 510))

# isin method
def using_isin():
    return df[df['A'].isin(values_to_check)]

# Direct comparison method using chaining comparison
def using_direct_comparison():
    return df[(500 <= df['A']) & (df['A'] <= 509)]

# Measure the performance
isin_time = timeit.timeit(using_isin, number=100)
chaining_comparison_time = timeit.timeit(using_direct_comparison, number=100)

print(f"Time using isin: {isin_time:.5f} seconds")
print(f"Time using direct comparison: {chaining_comparison_time:.5f} seconds")

Output:


Time using isin: 3.27576 seconds
Time using direct comparison: 0.60838 seconds

直接比较要快得多！

请注意，虽然直接比较更快在这种具体情况下（连续整数），但情况并非总是如此。

通过使用非连续值或者更复杂的过滤标准，区别并没有那么大：


import pandas as pd
import numpy as np
import timeit

# Create a sample dataframe
np.random.seed(42)
n = 10**6  # Number of rows in the DataFrame
df = pd.DataFrame({
    'A': np.random.randint(0, 1000, n)
})

values_to_check = [500, 502, 505, 507, 510, 515, 520]

# isin method combined with other conditions
def using_isin():
    even_and_greater_than_700 = (df['A'] > 700) & (df['A'] % 2 == 0)
    squared_ends_in_25 = (df['A']**2) % 100 == 25
    return df[df['A'].isin(values_to_check) | even_and_greater_than_700 | squared_ends_in_25]

# Direct comparison method using == and or combined with other conditions
def using_direct_comparison():
    return df[
        (df['A'] == 500) |
        (df['A'] == 502) |
        (df['A'] == 505) |
        (df['A'] == 507) |
        (df['A'] == 510) |
        (df['A'] == 515) |
        (df['A'] == 520) |
        ((df['A'] > 700) & (df['A'] % 2 == 0)) |
        ((df['A']**2) % 100 == 25)
    ]

# Measure the performance
isin_time = timeit.timeit(using_isin, number=100)
direct_comparison_time = timeit.timeit(using_direct_comparison, number=100)

print(f"Time using isin : {isin_time:.5f} seconds")
print(f"Time using direct comparison: {direct_comparison_time:.5f} seconds")

Output:


Time using isin : 6.36738 seconds
Time using direct comparison: 6.16720 seconds

直接比较稍微快一点。

实际例子（使用 isin 查找逻辑)

假设您正在分析书名数据集。您想要根据类型查找对这些标题进行分类。

您可以使用，而不是编写复杂的逻辑isin为了这个任务。


# Sample books DataFrame
books = pd.DataFrame({
    'Title': ['The Hobbit', 'War and Peace', 'The Great Gatsby', 'A Brief History of Time', 'Pride and Prejudice'],
})

# Genre lookup lists
fantasy_titles = ['The Hobbit', 'Harry Potter', 'Lord of the Rings']
classic_titles = ['War and Peace', 'The Great Gatsby', 'Pride and Prejudice']

# Using isin for lookup logic
books['Genre'] = 'Other'
books.loc[books['Title'].isin(fantasy_titles), 'Genre'] = 'Fantasy'
books.loc[books['Title'].isin(classic_titles), 'Genre'] = 'Classic'
print(books)

Output:


                    Title    Genre
0              The Hobbit  Fantasy
1           War and Peace  Classic
2        The Great Gatsby  Classic
3  A Brief History of Time    Other
4      Pride and Prejudice  Classic

Using isin结合loc，我们根据每本书的类型对其进行了分类。

Resource

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

pandas