pandas：reset_index及set_index的解释

2023-05-16

1、pandas.DataFrame.set_index

2、pandas.DataFrame.reset_index

1、pandas.DataFrame.set_index

函数原型：

DataFrame.set_index(self, keys, drop=True, append=False, inplace=False, verify_integrity=False)

作用：

使用现有列设置DataFrame索引，即使用一个或多个现有的列或数组(正确的长度)设置DataFrame索引(行标签)。索引可以替换现有索引或在其上展开。——看不懂？没关系，先记着，看下例子就知道了。

对常用参数的理解:

(参考官网：https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index)

keys：label or array-like or list of labels/arrays

需要进行set操作的单个或多个列名。

This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.

drop：bool, default True

是否删除将被作为新index的列。

Delete columns to be used as the new index.

举几个栗子说明：

0、首先我们简单了解下pandas的DataFrame的数据结构，如下表所示。target1、target2代表目标(target)，即数据所属的类别。label代表target的标签，也可以理解为target名，这部分可以没有。attribute1~attribute3为属性，即目标所具有的属性类别。1~6为数据内容。

DataFrame的结构
	attribute1	attribute2	attribute3
(label)
target1	1	2	3
target2	4	5	6

1、构造数据：

import pandas as pd
import numpy as np
df = pd.DataFrame({'month': [1, 4, 7, 10],
                   'year': [2012, 2014, 2013, 2014],
                   'sale': [55, 40, 84, 31]})
print(df)
"""------------------结果----------------------"""
   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

2、将索引设置为“month”列:

df2 = df.set_index('month')
print(df2)
"""------------------结果----------------------"""
       year  sale
month
1      2012    55
4      2014    40
7      2013    84
10     2014    31

3、将索引设置为“month”列，同时设置drop=False:

df5 = df.set_index('month', drop=False)
print(df5)
"""------------------结果----------------------"""
       month  year  sale
month
1          1  2012    55
4          4  2014    40
7          7  2013    84
10        10  2014    31

4、使用列'year'和'month'创建一个MultiIndex :

df3 = df.set_index(['year', 'month'])
print(df3)
"""------------------结果----------------------"""
            sale
year month
2012 1        55
2014 4        40
2013 7        84
2014 10       31

5、使用一个index和column创建一个MultiIndex :

df4 = df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
print(df4)
"""------------------结果----------------------"""
        month  sale
  year
1 2012      1    55
2 2014      4    40
3 2013      7    84
4 2014     10    31

2、pandas.DataFrame.reset_index

函数原型：

DataFrame.reset_index(self, level = None, drop = False, inplace = False, col_level = 0, col_fill = '')

作用：

官方解释：重置索引(index)或索引的一个级别(level)。重置DataFrame的索引，并使用默认索引。如果DataFrame有一个多索引，此方法可以删除一个或多个级别。—— 看完有点懵是吧。。。。。。我也是hhh，下面是通俗点的：

从函数的字面意思上讲就是重置索引的作用，通俗点说，还就是这么个意思！具体点儿呐？比如我们在使用随机森林的过程中，需要对原始数据进行随机抽样来组成新的样本数据，但此时都得到的样本数据在顺序上来看会有点乱，另外数据看上去也不会太整齐。利用该函数可以使数据得到重新排列并看起来更加整齐。该函数常用于在数据重组过后，对数据重新设置连续行索引。

对常用参数的理解（对于这些参数下面会有具体的例子）：

level：int, str, tuple, or list, default None

只从索引(index)中删除给定的级别(level)。默认移除所有级别。

drop：bool, default False

是否将索引重置为默认整数索引。即在新增整数索引时是否删除原来的索引。

举几个栗子说明(参考官网：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html)：

1、首先创建DataFrame，创建的DataFrame的index均为字符串：

import pandas as pd
import numpy as np
df = pd.DataFrame([('bird', 389.0),
                   ('bird', 24.0),
                   ('mammal', 80.5),
                   ('mammal', np.nan)],
                  index=['falcon', 'parrot', 'lion', 'monkey'],  # index是：猎鹰、鹦鹉、狮子、猴子
                  columns=('class', 'max_speed'))
print(df)
"""---------------运行结果---------------"""
         class  max_speed
falcon    bird      389.0
parrot    bird       24.0
lion    mammal       80.5
monkey  mammal        NaN

2、当我们重置索引时，此时默认drop=False，旧index被添加为一列，并使用一个新的连续index。注意结果与原来数据的区别，原数据的index被视为一个新的列并添加列名index，同时添加了一个新的索引0~3。

df1 = df.reset_index()
print(df1)
"""-------------运行结果-------------"""
    index   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN

3、我们可以设置drop=True来避免旧索引被添加为列，即在新增整数索引前删除旧索引。从结果看出，drop=True时直接将原有的index去掉，添加新的顺序索引。

df2 = df.reset_index(drop=True)
print(df2)
"""--------------运行结果----------------"""
    class  max_speed
0    bird      389.0
1    bird       24.0
2  mammal       80.5
3  mammal        NaN

4、另外，在MultiIndex的情况如下。我们重新构造多索引数据，index的label分别为‘class’与‘name’，注意，这里的index label不是列名。列名分别为speed_max和species_type。可以看出，多索引的数据结构看起来不是那么的整齐。

index = pd.MultiIndex.from_tuples([('bird', 'falcon'),
                                   ('bird', 'parrot'),
                                   ('mammal', 'lion'),
                                   ('mammal', 'monkey')],
                                  names=['class', 'name'])
columns = pd.MultiIndex.from_tuples([('speed_max',), ('species_type',)])
df3 = pd.DataFrame([(389.0, 'fly'),
                   ( 24.0, 'fly'),
                   ( 80.5, 'run'),
                   (np.nan, 'jump')],
                  index=index,
                  columns=columns)
print(df3)
"""-------------------运行结果---------------------"""
              speed_max species_type
class  name
bird   falcon     389.0          fly
       parrot      24.0          fly
mammal lion        80.5          run
       monkey       NaN         jump

5、接下来我们可以使用 reset_index来看下效果。这里将index label重置为列名，并将index作为新增的列存留。在重置索引之后相比原始数据是不是起码看着就很舒服了。

df4 = df3.reset_index()
print(df4)
"""-----------------运行结果---------------------"""
    class    name speed_max species_type
0    bird  falcon     389.0          fly
1    bird  parrot      24.0          fly
2  mammal    lion      80.5          run
3  mammal  monkey       NaN         jump

6、另外，如果像df3这样的索引有多个级别(level)，我们可以只重置其中的一个子集(subset)。比如重置df3的class子集，重置之后原label为‘’class‘的index被重置为列名为‘class’的数据列。

df5 = df3.reset_index(level='class')
print(df5)
"""----------------运行结果----------------"""
         class speed_max species_type
name
falcon    bird     389.0          fly
parrot    bird      24.0          fly
lion    mammal      80.5          run
monkey  mammal       NaN         jump

7、我们对df5执行drop=True的操作，结果如下。可以发现，结果只删除了1个index，即label为‘name’的index，这是由于在df5中我们已经把原label为‘class’的index重置为列名为‘class’的列数据了。

df6 = df5.reset_index(drop=True)
print(df6)
"""----------------运行结果----------------"""
    class speed_max species_type
0    bird     389.0          fly
1    bird      24.0          fly
2  mammal      80.5          run
3  mammal       NaN         jump

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)