shift、diff、applymap和apply、pdcut和pd.qcut、groupby+agg聚合、sort_values、merge和concat、value_counts、any和all
rename、drop、duplicates、drop_duplicates、dropna
胡言乱语可能有些只能用于1d array 与series ,有些2d、df都可以
结合常识猜猜用于哪,或者试试
注意pd开头、或者df开头
shift()
df.shift(self, periods=1, freq = None, axis=0)
移动df中的行列
periods正为下移或右移
import pandas as pd
df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],
'Col2': [13, 23, 18, 33, 48],
'Col3': [17, 27, 22, 37, 52]})
print(df.shift()) # 跨行操作下移1行
print(df.shift(-2)) # 跨行操作上移2行
print(df.shift(1, axis = 'columns')) # 跨列操作右移1行
# 结果
Col1 Col2 Col3
0 NaN NaN NaN
1 10.0 13.0 17.0
2 20.0 23.0 27.0
3 15.0 18.0 22.0
4 30.0 33.0 37.0
Col1 Col2 Col3
0 15.0 18.0 22.0
1 30.0 33.0 37.0
2 45.0 48.0 52.0
3 NaN NaN NaN
4 NaN NaN NaN
Col1 Col2 Col3
0 NaN 10.0 13.0
1 NaN 20.0 23.0
2 NaN 15.0 18.0
3 NaN 30.0 33.0
4 NaN 45.0 48.0
diff()
df.diff(self, period=1, axis=0)
计算df - df.shift()
df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],
'Col2': [13, 23, 18, 33, 48],
'Col3': [17, 27, 22, 37, 52]})
print(df.diff())
# 结果
Col1 Col2 Col3
0 NaN NaN NaN
1 10.0 10.0 10.0
2 -5.0 -5.0 -5.0
3 15.0 15.0 15.0
4 15.0 15.0 15.0
applymap() 与 apply()
用于调用无法用内置函数或算法完成的函数,比如用于:某个函数只能作用在series
自己编写的函数df.applymap(function) 对每个元素进行调用
df.apply(fuctiion, axis = 0) 对每一列/行中的每个元素进行操作。例如,要把每列排在前20%的数据改完‘A’
pd.cut() 数值本身分
pd.cut(series/1d array,bins,right=True,labels=None)
指定bins序列会更明确
In [22]: pd.cut(np.array([1.2,2.2,3.2,4.2,5.2]), [1,3])
Out[22]:
[(1.0, 3.0], (1.0, 3.0], NaN, NaN, NaN]
Categories (1, interval[int64]): [(1, 3]]
In [24]: pd.cut(np.array([1.2,2.2,3.2,4.2,5.2]), [1,2,3], right = False)
Out[24]:
[[1.0, 2.0), [2.0, 3.0), NaN, NaN, NaN]
Categories (2, interval[int64]): [[1, 2) < [2, 3)]
In [25]: pd.cut(np.array([1.2,2.2,3.2,4.2,5.2]),2)
Out[25]:
[(1.196, 3.2], (1.196, 3.2], (1.196, 3.2], (3.2, 5.2], (3.2, 5.2]]
Categories (2, interval[float64]): [(1.196, 3.2] < (3.2, 5.2]]
In [27]: pd.cut(np.array([1.2,2.2,3.2,4.2,5.2]), [1,2,3], labels = ['1-2','2-3'])
Out[27]:
[1-2, 2-3, NaN, NaN, NaN]
Categories (2, object): [1-2 < 2-3]
pd.qcut() 百分比分
可以把一组数据按大小区间进行分区
pd.qcut(series/1d array, 区间隔点, labels = []),labels数量要和区间数量一样注意是pd开头
用于1d array与series
虽然不用于df,但可借助apply
data = pd.Series([0,8,1,5,3,7,2,6,10,4,9])
print(pd.qcut(data,[0, 0.1, 0.2, 0.3, 1],labels=['first 10%','second 10%','third 10%','70%']))
# 结果
# 这里因为数据里有11个数,没法刚好按照 1:1:1:7 分,所以 0和1,都被分到了 'first10%' 这一类.
0 first 10%
1 70%
2 first 10%
3 70%
4 third 10%
5 70%
6 second 10%
7 70%
8 70%
9 70%
10 70%
dtype: category
Categories (4, object): [first 10% < second 10% < third 10% < 70%]
groupby
df.groupby(self, by = None, axis = 0, as_index=True)按by指定的列名分组
as_index 决定分组列是否变为index
# 创建df
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
'value': values,
'even': values % 2 == 0,
'above_three': values > 3
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# 定义函数
def standardize(xs):
return (xs - xs.mean()) / xs.std()
# 分组
# 按'even'分组后,选出value列,对分组后的每个元素来使用函数standardize。在组里操作
grouped_data = example_df.groupby('even')
print(grouped_data['value'].apply(standardize))
# 结果
a -0.577350
b 1.154701
c -1.224745
d 0.000000
e -0.577350
f 1.224745
g 0.000000
Name: value, dtype: float64
agg聚合
groupby分组后的agg聚合:可多个函数同时进行df.groupby('Country').agg(['min','max'])
df.groupby('Country').agg({'Income':'min'})
df.groupby('Country').agg({'Income':'min' , 'Age':['min','mean']})
In [17]: df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'],
...: 'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
...: 'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
In [18]: df
Out[18]:
Country Income Age
0 China 10000 5000
1 China 10000 4321
2 India 5000 1234
3 India 5002 4010
4 America 40000 250
5 Japan 50000 250
6 China 8000 4500
7 India 5000 4321
In [19]: df.groupby('Country').agg(['min','max'])
Out[19]:
Income Age
min max min max
Country
America 40000 40000 250 250
China 8000 10000 4321 5000
India 5000 5002 1234 4321
Japan 50000 50000 250 250
In [20]: df.groupby('Country').agg({'Income':'min'})
Out[20]:
Income
Country
America 40000
China 8000
India 5000
Japan 50000
In [21]: df.groupby('Country').agg({'Income':'min' , 'Age':['min','mean']})
Out[21]:
Income Age
min min mean
Country
America 40000 250 250.000000
China 8000 4321 4607.000000
India 5000 1234 3188.333333
Japan 50000 250 250.000000
sort_values
排序
df.sortvalues(by=指定列名/行名, axis=0, ascending = True, inplace = False, na_position = {'first,','last'}根据by指定的列排序
ascending = False 时是倒序
inplace = True是替换原数据
na_postion设置缺失值位置
df=pd.DataFrame({'col1':['A','A','B',np.nan,'D','C'],
'col2':[2,1,9,8,7,7],
'col3':[0,1,9,4,2,8]
})
print(df.sort_values(by=['col1','col2'], na_position='first'))
# 结果
col1 col2 col3
3 NaN 8 4
1 A 1 1
0 A 2 0
2 B 9 9
5 C 7 8
4 D 7 2
merge()
pd.merge(参与合并的左df, 参与合并的右df, how = 'inner', on , lefton, right on, suffixes……)
或者df1.merge(df2, …)
不用print也能显示表格how = 'inner'内连接-key交集, 'outer'外连接-key并集, 'left'左连接, 'right'有链接
on
# how = 'inner', 等价于on = 'key',按照相同的字段key进行合并。
df1 = pd.DataFrame({'key':list('bbaca'), 'data1':range(8)})
df2 = pd.DataFrame({'key':['a','b','d'], 'data2':range(3)})
print(pd.merge(df1,df2))
# 等价
print(pd.merge(df1, df2, on='key'))
# 结果
data1 key data2
0 0 b 1
1 1 b 1
2 2 a 0
3 4 a 0
# how = 'outer',两边的值都连起来,缺失的NaN填充
print(pd.merge(df1,df2, how='outer'))
# 结果
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 2.0 a 0.0
3 4.0 a 0.0
4 3.0 c NaN
5 NaN d 2.0
# how = 'left', 取左df的全部
print(pd.merge(df1,df2, how='left'))
# 结果
data1 key data2
0 0 b 1.0
1 1 b 1.0
2 2 a 0.0
3 3 c NaN
4 4 a 0.0
# how='right', 取右df全部
print(pd.merge(df1,df2, how='right'))
# 结果
data1 key data2
0 0.0 b 1
1 1.0 b 1
2 2.0 a 0
3 4.0 a 0
4 NaN d 2
# how和on一起
subway_df.merge(weather_df, on =['DATEn','hour','latitude','longitude'],how = 'inner')left_on、right_on、 suffixes
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
print(pd.merge(df1,df2, left_on='lkey', right_on='rkey', suffixes=('_1','_2')))
# 结果
lkey value_1 rkey value_2
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
concat()
pd.concat([df1,df2,..], axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)参数 join = 'inner'/'outer'
axis = 0 行接;aixs =1 列接
In [2]: a = pd.DataFrame([[1,2,3],
...: [4,5,6],
...: [7,8,9]],columns=['a','b','c'])
...: b = pd.DataFrame([[11,23,45],
...: [22,23,24],
...: [31,32,33]],columns=['a','b','c'])
In [3]: pd.concat([a,b])
Out[3]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
0 11 23 45
1 22 23 24
2 31 32 33
In [4]: pd.concat([a,b],axis=1,join = 'inner')
Out[4]:
a b c a b c
0 1 2 3 11 23 45
1 4 5 6 22 23 24
2 7 8 9 31 32 33
value_counts()
计算数据出现的频率: df['item_name'].value_counts()升序排序:df['item_name'].value_counts(ascending = True)
计算占比:df['item_name'].value_counts(normalize = True)
空白值默认剔除
df.apply(pd.value_counts)
In [24]: df = pd.DataFrame({'区域1' : ['西安', '太原', '西安', '太原', '郑州', '太原'],
...: '区域2' : ['太原', '太原', '西安', '西安', '西安', '太原']})
...: print(df.apply(pd.value_counts))
区域1 区域2
太原 3 3.0
西安 2 3.0
郑州 1 NaN
any
any(iterable)参数iterable:可迭代对象;如果当iterable所有的值都是0、''或False时,那么结果为False
如果所有元素中有一个值非0、''或False,那么结果就为True
any(df.col.isnull()) 若为true则说明col列有空值
all
所有都非0、''或False,才返回True
rename
列标签是不能修改的内容 即df.columns[6] = 'new_name'报错
但df.columns = ['a','b']可修改
正确(复杂方法):
new_index = list(df.columns)
new_index[6] = 'new_new'
df.columns = new_index
(简单方法) rename函数:df.rename(columns = {'a':'aa'},inplace = True)
可修改索引,也可修改列名
要加上inplace = True 才可以在原df上进行修改
In [11]: df = pd.DataFrame([[1,2,3],[4,5,6]],columns = {'a','b','c'},index = {'d','e'})
In [12]: df
Out[12]:
a c b
e 1 2 3
d 4 5 6
In [13]: df.rename(columns = {'a':'aa'},inplace = True)
In [14]: df
Out[14]:
aa c b
e 1 2 3
d 4 5 6
In [16]: df.rename(index = {'e':'ee'},inplace = True)
In [17]: df
Out[17]:
aa c b
ee 1 2 3
d 4 5 6
drop
df.drop(labels=None, axis=0, index=None, level=None, inplace=False, errors='raise')labels 要删除的标签,或者用columns
axis 列标签还是行标签
del df['A'] 会在df直接删除,drop则需inplace
duplicated
查找重复值参数subset,用来标记列,默认所有列
参数keep'first' 将第一次出现的值标为True
'last' 将最后一次出现的值标为True
False 将所有重复值标记为True
参数inplace
In [12]: df= pd.DataFrame({'k1': [ 's1']* 3 + ['s2']* 5,'k2' : [1, 1, 2, 3, 3, 4, 4,4]})
In [13]: df
Out[13]:
k1 k2
0 s1 1
1 s1 1
2 s1 2
3 s2 3
4 s2 3
5 s2 4
6 s2 4
7 s2 4
In [14]: df.duplicated()
Out[14]:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
7 True
dtype: bool
In [15]: df.duplicated(keep = 'first')
Out[15]:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
7 True
dtype: bool
In [16]: df.duplicated(keep = 'last')
Out[16]:
0 True
1 False
2 False
3 True
4 False
5 True
6 True
7 False
dtype: bool
In [17]: df.duplicated(keep = False)
Out[17]:
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
drop_duplicates
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
keep = 'True'删除重复项并保留第一次出现的值
dropna
df.dropna(axis=0, how='any', inplace=False)how参数:any只要有空值就删除,all全是空值才删除