利用python进行数据分析之数据清洗与准备--小白笔记

2023-11-13

数据清洗和准备

处理缺失数据

import pandas as pd
import numpy as np

string_data=pd.Series(['aardvark','artichoke',np.nan,'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

对于数值数据，pandas使用浮点值NaN（Not a Number）表示缺失数据

string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

即将缺失值表示为NA，它表示不可用not available。在统计应用中，NA数据可能是不存在的数据或者虽然存在，但是没
有观察到（例如，数据采集中发生了问题）
Python内置的None值在对象数组中也可以作为NA：

string_data[0]=None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

关于缺失数据处理的函数：
dropna:根据各标签的之值中是否存在缺失数据对轴标签进行过滤，可通过与之调节对缺失值得容忍度
fillna:用指定值或插值方法（ffill或者bfill）填充数据
isnull：返回一个含有布尔值的对象，这些对象表示哪些值是缺失值NA，该对象的类型与原类型一样
notnull：isnull的否定式

from numpy import nan as NA
data=pd.Series([1,NA,3.5,NA,7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

#等价于
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

dropna默认丢弃任何含有缺失值的列

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
cleaned=data.dropna()

data

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

cleaned

	0	1	2
0	1.0	6.5	3.0

#how='all'丢弃全为空值的那些行
cleaned_how=data.dropna(how='all')
cleaned_how

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

#用这种方式丢弃列，axis=1
data[4]=NA
data

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

data.dropna(how='all',axis=1)

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

除DataFrame行的问题涉及时间序列数据。假设你只想留下一部分观测数
据，可以用thresh参数实现此目的：

thresh参数用法是：保留至少有n个非NaN数据的行/列

df=pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1]=NA
df.iloc[:2,2]=NA
df

	0	1	2
0	1.219978	NaN	NaN
1	0.341182	NaN	NaN
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

df.dropna()

	0	1	2
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

df.dropna(thresh=2)

	0	1	2
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

填充缺失数据fillna（）

df.fillna(0)

	0	1	2
0	1.219978	0.000000	0.000000
1	0.341182	0.000000	0.000000
2	0.782306	0.000000	0.402269
3	0.033353	0.000000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

若是通过一个字典调用fillna，就可以实现对不同的列填充不同的值

df.fillna({1:0.5,2:0})

	0	1	2
0	1.219978	0.500000	0.000000
1	0.341182	0.500000	0.000000
2	0.782306	0.500000	0.402269
3	0.033353	0.500000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

_=df.copy()
_

	0	1	2
0	1.219978	NaN	NaN
1	0.341182	NaN	NaN
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

_.fillna(0)

	0	1	2
0	1.219978	0.000000	0.000000
1	0.341182	0.000000	0.000000
2	0.782306	0.000000	0.402269
3	0.033353	0.000000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

	0	1	2
0	1.219978	NaN	NaN
1	0.341182	NaN	NaN
2	0.782306	NaN	0.402269
3	0.033353	NaN	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

# fillna默认会返回新对象，但也可以对现有对象进行就地修改：
_.fillna(0,inplace=True)
_

	0	1	2
0	1.219978	0.000000	0.000000
1	0.341182	0.000000	0.000000
2	0.782306	0.000000	0.402269
3	0.033353	0.000000	0.666443
4	-0.761581	-1.232945	-0.291452
5	-0.516256	-0.442507	0.850908
6	1.827264	0.286749	0.924544

df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

	0	1	2
0	-1.029961	-0.41851	0.634309
1	-0.621635	-0.24739	0.783342
2	-1.659875	NaN	-0.231234
3	0.513173	NaN	-1.094123
4	1.787183	NaN	NaN
5	-0.611099	NaN	NaN

df.fillna(method='ffill')#ffill向前填充

	0	1	2
0	-1.029961	-0.41851	0.634309
1	-0.621635	-0.24739	0.783342
2	-1.659875	-0.24739	-0.231234
3	0.513173	-0.24739	-1.094123
4	1.787183	-0.24739	-1.094123
5	-0.611099	-0.24739	-1.094123

df.fillna(0,limit=2)#limit限制填充的个数

	0	1	2
0	-1.029961	-0.41851	0.634309
1	-0.621635	-0.24739	0.783342
2	-1.659875	0.00000	-0.231234
3	0.513173	0.00000	-1.094123
4	1.787183	NaN	0.000000
5	-0.611099	NaN	0.000000

fillna

value:用于填充缺失值的标量值或字典对象
method：插值方式，默认ffill
axis：待填充的轴，默认axis=0
inplace：修改调用者对象而不产生副本
limit：（对于前向和后向填充）可以连续填充的最大数量

数据转换

重复数据处理

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

DataFrame的duplicated方法返回一个布尔型Series，表示各行是否是重复行

data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

data.drop_duplicates()

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

data['v1']=range(7)
data

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
5	two	4	5
6	two	4	6

# 指定部分列进行重复项判断
data.drop_duplicates(['k1'])

	k1	k2	v1
0	one	1	0
1	two	1	1

duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入keep='last’则
保留最后一个：

data.drop_duplicates(['k1','k2'],keep='last')

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
6	two	4	6

利用函数或映射进行数据转换

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon','pastrami', 'honey ham','nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3,5, 6]})
data

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased=data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

data['animal']=lowercased.map(meat_to_animal)
data

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

data['food'].map(lambda x:meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

替换值replace

data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

data.replace([-999,-1000],[np.nan,0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

# 也可以传递字典
data.replace({-999:np.nan,-1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

重命名轴索引

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data.index.map(lambda x:x[:4].upper())

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

data.index =data.index.map(lambda x:x[:4].upper())
data

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

如果想要创建数据集的转换版（而不是修改原始数据），比较实用的方法是
rename

data.rename(index=str.upper,columns=str.upper)

	ONE	TWO	THREE	FOUR
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

# 特别说明一下，rename可以结合字典型对象实现对部分轴标签的更新
data.rename(index={'Ohio':'INDIANA'},
           columns={'three':'peekaboo'})

	one	two	peekaboo	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

# 如果希望就地修改某个数据集，传入inplace=True即可：
data.rename(index={'Ohio':'indiana'},inplace=True)
data

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

离散化和面元划分

为了便于分析，连续数据常常被离散化或拆分为“面元”（bin）,使用pandas的cut函数

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats=pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

# pd.value_counts(cats)是pandas.cut结果的面元计数
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

跟“区间”的数学符号一样，圆括号表示开端，而方括号则表示闭端（包括）。哪边
是闭端可以通过right=False进行修改

pd.cut(ages,[18,26,36,61,100],right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

你可以通过传递一个列表或数组到labels，设置自己的面元名称

group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages,bins,labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

data=np.random.rand(20)
data

array([0.50910844, 0.01886219, 0.95908375, 0.72900936, 0.88044385,
       0.94608156, 0.13493984, 0.91195245, 0.46857512, 0.38525391,
       0.02991488, 0.31362695, 0.15493992, 0.74873532, 0.6170826 ,
       0.84356457, 0.09466064, 0.01974264, 0.97598584, 0.43164735])

如果向cut传入的是面元的数量而不是确切的面元边界，则它会根据数据的最小值和
最大值计算等长面元

pd.cut(data,4,precision=2)#选项precision=2，限定小数只有两位。

[(0.5, 0.74], (0.018, 0.26], (0.74, 0.98], (0.5, 0.74], (0.74, 0.98], ..., (0.74, 0.98], (0.018, 0.26], (0.018, 0.26], (0.74, 0.98], (0.26, 0.5]]
Length: 20
Categories (4, interval[float64, right]): [(0.018, 0.26] < (0.26, 0.5] < (0.5, 0.74] < (0.74, 0.98]]

qcut是一个非常类似于cut的函数，它可以根据样本分位数对数据进行面元划分。根
据数据的分布情况，cut可能无法使各个面元中含有相同数量的数据点。而qcut由于
使用的是样本分位数，因此可以得到大小基本相等的面元：

data=np.random.randn(1000)
cats=pd.qcut(data,4)
cats

[(-0.601, -0.0125], (-2.885, -0.601], (-0.0125, 0.673], (-0.0125, 0.673], (-2.885, -0.601], ..., (-2.885, -0.601], (-2.885, -0.601], (-0.0125, 0.673], (-0.0125, 0.673], (0.673, 3.875]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.885, -0.601] < (-0.601, -0.0125] < (-0.0125, 0.673] < (0.673, 3.875]]

pd.value_counts(cats)

(-2.885, -0.601]     250
(-0.601, -0.0125]    250
(-0.0125, 0.673]     250
(0.673, 3.875]       250
Name: count, dtype: int64

与cut类似，你也可以传递自定义的分位数（0到1之间的数值，包含端点）

pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.22, -0.0125], (-2.885, -1.22], (-0.0125, 1.303], (-0.0125, 1.303], (-1.22, -0.0125], ..., (-1.22, -0.0125], (-2.885, -1.22], (-0.0125, 1.303], (-0.0125, 1.303], (-0.0125, 1.303]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.885, -1.22] < (-1.22, -0.0125] < (-0.0125, 1.303] < (1.303, 3.875]]

检测和过滤异常值

过滤或变换异常值（outlier）在很大程度上就是运用数组运算。

data=pd.DataFrame(np.random.randn(1000,4))
data.describe()

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.001687	-0.036570	0.049180	0.010509
std	1.007410	0.971646	1.013227	0.982367
min	-3.127882	-2.643439	-2.949846	-2.962251
25%	-0.678444	-0.719395	-0.650478	-0.636513
50%	0.001463	-0.022189	0.061264	0.046104
75%	0.675910	0.629861	0.710325	0.642668
max	3.162745	4.108418	3.597951	4.410464

col=data[2]
col[np.abs(col)>3]

565    3.597951
Name: 2, dtype: float64

data[(np.abs(data) > 3).any(axis=1)]
# 使用 .any(1) 方法，它会检查每一行中是否存在至少一个 True 值，即是否有至少一个绝对值大于 3 的元素。这将返回一个布尔型的 Series，其中每个元素对应于每一行是否满足条件。

	0	1	2	3
16	-3.010992	-0.122886	1.194125	0.702766
111	-1.152743	4.108418	-2.097178	0.831827
219	-3.127882	1.781813	0.011281	0.587799
565	0.099141	-1.705600	3.597951	0.345174
596	3.162745	-1.597465	-0.552896	-2.756078
625	-0.042392	3.189888	0.723891	-0.670110
835	-1.125737	-0.699685	-1.730857	4.410464

data[np.abs(data) > 3] = np.sign(data) * 3#以将值限制在区间－3到3以内
data.describe()

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.001711	-0.037868	0.048582	0.009099
std	1.006490	0.966920	1.011305	0.977041
min	-3.000000	-2.643439	-2.949846	-2.962251
25%	-0.678444	-0.719395	-0.650478	-0.636513
50%	0.001463	-0.022189	0.061264	0.046104
75%	0.675910	0.629861	0.710325	0.642668
max	3.000000	3.000000	3.000000	3.000000

np.sign(data).head()

	0	1	2	3
0	1.0	1.0	1.0	-1.0
1	-1.0	1.0	1.0	-1.0
2	1.0	1.0	1.0	-1.0
3	1.0	1.0	1.0	-1.0
4	-1.0	-1.0	-1.0	-1.0

排列和随机采样

利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排
列工作（permuting，随机重排序）。通过需要排列的轴的长度调用permutation，
可产生一个表示新顺序的整数数组

df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

sampler = np.random.permutation(5)
sampler

array([4, 0, 3, 1, 2])

df.take(sampler)

	0	1	2	3
4	16	17	18	19
0	0	1	2	3
3	12	13	14	15
1	4	5	6	7
2	8	9	10	11

df.sample(n=3)

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
3	12	13	14	15

choices=pd. Series([5,7,-1,6,4])
draws=choices.sample(n=10,replace=True)
draws

0    5
2   -1
0    5
3    6
0    5
3    6
1    7
1    7
2   -1
0    5
dtype: int64

哑变量处理

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

pd.get_dummies(df['key'])

	a	b	c
0	False	True	False
1	False	True	False
2	True	False	False
3	False	False	True
4	True	False	False
5	False	True	False

你可能想给指标DataFrame的列加上一个前缀，以便能够跟其他数据进行
合并。get_dummies的prefix参数可以实现该功能

dummies=pd.get_dummies(df['key'],prefix='key')
df_with_dummy=df[['data1']].join(dummies)
df_with_dummy

	data1	key_a	key_b	key_c
0	0	False	True	False
1	1	False	True	False
2	2	True	False	False
3	3	False	False	True
4	4	True	False	False
5	5	False	True	False

如果DataFrame中的某行同属于多个分类，则事情就会有点复杂。看一下
MovieLens 1M数据集

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')
movies[:10]

C:\Users\Dell\AppData\Local\Temp\ipykernel_26068\3411970987.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children's
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

all_genres=[]
movies.genres.map(lambda x:all_genres.extend(x.split('|')))
all_genres

['Animation',
 "Children's",
 'Comedy',
 'Adventure',
 "Children's",
 'Fantasy',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Comedy',
 'Action',
 'Crime',
 'Thriller',
 'Comedy',
 'Romance',
 'Adventure',
 "Children's",
 'Action',
 'Action',
 'Adventure',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Horror',
 'Animation',
 "Children's",
 'Drama',
 'Action',
 'Adventure',
 'Romance',
 'Drama',
 'Thriller',
 'Drama',
 'Romance',
 'Thriller',
 'Comedy',
 'Action',
 'Action',
 'Comedy',
 'Drama',
 'Crime',
 'Drama',
 'Thriller',
 'Thriller',
 'Drama',
 'Sci-Fi',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Romance',
 'Adventure',
 'Sci-Fi',
 'Drama',
 'Drama',
 'Drama',
 'Sci-Fi',
 'Adventure',
 'Romance',
 "Children's",
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Documentary',
 'Comedy',
 'Comedy',
 'Romance',
 'Drama',
 'Drama',
 'War',
 'Action',
 'Crime',
 'Drama',
 'Drama',
 'Action',
 'Adventure',
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Crime',
 'Thriller',
 'Animation',
 "Children's",
 'Musical',
 'Romance',
 'Drama',
 'Romance',
 'Crime',
 'Thriller',
 'Action',
 'Drama',
 'Thriller',
 'Comedy',
 'Drama',
 "Children's",
 'Comedy',
 'Drama',
 'Adventure',
 "Children's",
 'Fantasy',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Mystery',
 'Adventure',
 "Children's",
 'Fantasy',
 'Drama',
 'Thriller',
 'Drama',
 'Comedy',
 'Comedy',
 'Romance',
 'Comedy',
 'Sci-Fi',
 'Thriller',
 'Drama',
 'Comedy',
 'Romance',
 'Comedy',
 'Action',
 'Comedy',
 'Crime',
 'Horror',
 'Thriller',
 'Action',
 'Comedy',
 'Drama',
 'Drama',
 'Musical',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Sci-Fi',
 'Thriller',
 'Documentary',
 'Drama',
 'Drama',
 'Thriller',
 'Drama',
 'Crime',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'Adventure',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Action',
 'Thriller',
 'Drama',
 'Drama',
 'Thriller',
 'Comedy',
 'Romance',
 'Drama',
 'Action',
 'Thriller',
 'Comedy',
 'Drama',
 'Action',
 'Thriller',
 'Documentary',
 'Drama',
 'Thriller',
 'Comedy',
 'Comedy',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Adventure',
 "Children's",
 'Comedy',
 'Musical',
 'Documentary',
 'Comedy',
 'Action',
 'Drama',
 'War',
 'Drama',
 'Thriller',
 'Action',
 'Adventure',
 'Crime',
 'Drama',
 'Mystery',
 'Drama',
 'Comedy',
 'Documentary',
 'Crime',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Romance',
 'Drama',
 'Mystery',
 'Romance',
 'Drama',
 'Comedy',
 'Adventure',
 "Children's",
 'Fantasy',
 'Drama',
 'Documentary',
 'Comedy',
 'Romance',
 'Drama',
 'Drama',
 'Romance',
 'Thriller',
 'Comedy',
 'Drama',
 'Documentary',
 'Comedy',
 'Documentary',
 'Documentary',
 'Drama',
 'Action',
 'Drama',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Action',
 'Adventure',
 "Children's",
 'Drama',
 'Drama',
 'Crime',
 'Drama',
 'Thriller',
 'Drama',
 'Drama',
 'Romance',
 'War',
 'Horror',
 'Action',
 'Adventure',
 'Comedy',
 'Crime',
 'Drama',
 'Drama',
 'War',
 'Comedy',
 'Comedy',
 'War',
 'Adventure',
 "Children's",
 'Drama',
 'Action',
 'Adventure',
 'Mystery',
 'Sci-Fi',
 'Drama',
 'Thriller',
 'War',
 'Documentary',
 'Action',
 'Romance',
 'Thriller',
 'Crime',
 'Film-Noir',
 'Mystery',
 'Thriller',
 'Action',
 'Thriller',
 'Comedy',
 'Drama',
 'Drama',
 'Action',
 'Adventure',
 'Drama',
 'Romance',
 'Adventure',
 "Children's",
 'Drama',
 'Action',
 'Crime',
 'Thriller',
 'Comedy',
 'Action',
 'Sci-Fi',
 'Thriller',
 'Action',
 'Adventure',
 'Sci-Fi',
 'Comedy',
 'Drama',
 'Comedy',
 'Horror',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Action',
 "Children's",
 'Drama',
 'Romance',
 'Thriller',
 'Drama',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Comedy',
 'Horror',
 'Comedy',
 'Thriller',
 'Drama',
 'Documentary',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Romance',
 'Horror',
 'Sci-Fi',
 'Drama',
 'Action',
 'Crime',
 'Sci-Fi',
 'Drama',
 'Musical',
 'Thriller',
 'Drama',
 'Drama',
 'Romance',
 'Comedy',
 'Action',
 'Comedy',
 'Drama',
 'Documentary',
 'Drama',
 'Romance',
 'Action',
 'Adventure',
 'Drama',
 'Western',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Comedy',
 'Drama',
 'Thriller',
 'Comedy',
 'Drama',
 'Drama',
 'Horror',
 'Drama',
 'Romance',
 'Comedy',
 'Comedy',
 'Drama',
 'Romance',
 'Drama',
 'Thriller',
 'Thriller',
 'Action',
 'Comedy',
 'Drama',
 'Thriller',
 'Drama',
 'Thriller',
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Comedy',
 'Romance',
 'Comedy',
 'Romance',
 'Adventure',
 "Children's",
 'Animation',
 "Children's",
 'Comedy',
 'Romance',
 'Thriller',
 "Children's",
 'Drama',
 'Drama',
 'Musical',
 'Comedy',
 'Animation',
 "Children's",
 'Crime',
 'Drama',
 'Documentary',
 'Drama',
 'Fantasy',
 'Romance',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 "Children's",
 'Comedy',
 'Action',
 'Comedy',
 'Romance',
 'Drama',
 'Horror',
 'Drama',
 'Comedy',
 'Comedy',
 'Sci-Fi',
 'Mystery',
 'Thriller',
 'Adventure',
 "Children's",
 'Comedy',
 'Fantasy',
 'Romance',
 'Crime',
 'Drama',
 'Thriller',
 'Action',
 'Adventure',
 'Fantasy',
 'Sci-Fi',
 'Drama',
 "Children's",
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Romance',
 'War',
 'Western',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Drama',
 'Horror',
 'Comedy',
 'Comedy',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Thriller',
 'Drama',
 'Drama',
 'Crime',
 'Drama',
 'Action',
 'Crime',
 'Drama',
 'Horror',
 'Action',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Romance',
 'Action',
 'Thriller',
 'Comedy',
 'Romance',
 'Crime',
 'Drama',
 'Thriller',
 'Action',
 'Drama',
 'Thriller',
 'Crime',
 'Drama',
 'Romance',
 'Thriller',
 'Comedy',
 'Romance',
 'Comedy',
 'Romance',
 'Crime',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Romance',
 'Action',
 'Adventure',
 'Western',
 'Comedy',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Comedy',
 'Horror',
 'Thriller',
 'Comedy',
 'Animation',
 "Children's",
 'Drama',
 'Action',
 'Action',
 'Adventure',
 'Sci-Fi',
 "Children's",
 'Comedy',
 'Fantasy',
 'Drama',
 'Thriller',
 'Film-Noir',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Action',
 'Comedy',
 'Musical',
 'Sci-Fi',
 'Horror',
 'Action',
 'Adventure',
 'Sci-Fi',
 'Comedy',
 'Horror',
 'Drama',
 'Horror',
 'Sci-Fi',
 'Comedy',
 'Drama',
 'Mystery',
 'Thriller',
 'Drama',
 'War',
 'Drama',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Romance',
 'Adventure',
 'Drama',
 'Drama',
 'Comedy',
 'Romance',
 "Children's",
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Musical',
 'Drama',
 'Comedy',
 'Action',
 'Adventure',
 'Thriller',
 'Drama',
 'Mystery',
 'Thriller',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Action',
 'Romance',
 'Thriller',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Romance',
 'War',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 'Romance',
 'Action',
 'Comedy',
 'Drama',
 'Romance',
 'Adventure',
 "Children's",
 'Romance',
 'Documentary',
 'Animation',
 "Children's",
 'Musical',
 'Drama',
 'Horror',
 'Comedy',
 'Crime',
 'Fantasy',
 'Action',
 'Comedy',
 'Western',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Comedy',
 'Drama',
 'Thriller',
 "Children's",
 'Comedy',
 'Drama',
 'Action',
 'Thriller',
 'Action',
 'Romance',
 'Thriller',
 'Comedy',
 'Romance',
 'Action',
 'Sci-Fi',
 'Action',
 'Adventure',
 'Comedy',
 'Romance',
 'Drama',
 'Drama',
 'Horror',
 'Western',
 'Action',
 'Drama',
 'Drama',
 'Action',
 'Comedy',
 'Drama',
 'Drama',
 'Romance',
 'War',
 'Action',
 'Comedy',
 'Drama',
 'Crime',
 'Drama',
 'Adventure',
 "Children's",
 'Action',
 'Action',
 'Drama',
 'Drama',
 'Horror',
 'Documentary',
 'Drama',
 'Drama',
 'Action',
 'Thriller',
 'Comedy',
 'Comedy',
 'Crime',
 'Drama',
 'Documentary',
 'Action',
 'Sci-Fi',
 'Drama',
 'Horror',
 'Thriller',
 'Drama',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Comedy',
 'Comedy',
 'Comedy',
 'Thriller',
 'Western',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Action',
 'Comedy',
 'Adventure',
 "Children's",
 'Thriller',
 'Action',
 'Thriller',
 'Drama',
 'Drama',
 'Romance',
 'Horror',
 'Sci-Fi',
 'Thriller',
 'Mystery',
 'Romance',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Crime',
 'Drama',
 'Comedy',
 'Western',
 'Comedy',
 'Action',
 'Adventure',
 'Crime',
 'Comedy',
 'Sci-Fi',
 'Drama',
 'Thriller',
 'Comedy',
 'Action',
 'Comedy',
 'Drama',
 'Comedy',
 'Romance',
 'Comedy',
 'Action',
 'Sci-Fi',
 'Documentary',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Mystery',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 'Thriller',
 'Adventure',
 "Children's",
 'Drama',
 'Drama',
 'Action',
 'Thriller',
 'Drama',
 'Western',
 'Action',
 'Comedy',
 'Drama',
 'Romance',
 'Action',
 'Adventure',
 'Crime',
 'Drama',
 'Thriller',
 'Action',
 'Adventure',
 'Crime',
 'Thriller',
 'Action',
 'Drama',
 'War',
 'Action',
 'Comedy',
 'War',
 'Comedy',
 'Comedy',
 'Romance',
 'Drama',
 'Romance',
 'Comedy',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Comedy',
 'War',
 'Action',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Action',
 'Action',
 'Adventure',
 'Sci-Fi',
 'Drama',
 'Thriller',
 'Thriller',
 'Drama',
 'Adventure',
 "Children's",
 'Action',
 'Comedy',
 'Comedy',
 'Comedy',
 'Western',
 'Drama',
 'Comedy',
 'Thriller',
 'Drama',
 'Comedy',
 'Mystery',
 'Action',
 'Crime',
 'Drama',
 'Action',
 'Thriller',
 'Drama',
 'Comedy',
 'Drama',
 'Romance',
 'Comedy',
 'Romance',
 'Drama',
 'Romance',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama',
 'Action',
 "Children's",
 'Drama',
 'Action',
 'Sci-Fi',
 'Comedy',
 'Drama',
 'Action',
 'Drama',
 'Drama',
 'Drama',
 'Romance',
 'Drama',
 'Action',
 'Drama',
 'Horror',
 'Sci-Fi',
 'Comedy',
 'Mystery',
 'Romance',
 'Comedy',
 'Drama',
 'Comedy',
 'Drama',
 'War',
 'Action',
 'Drama',
 'Mystery',
 'Comedy',
 'Sci-Fi',
 'Thriller',
 'Comedy',
 'Crime',
 'Thriller',
 'Action',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'War',
 'Drama',
 'Drama',
 'Drama',
 "Children's",
 'Drama',
 'Comedy',
 'Crime',
 'Horror',
 'Action',
 'Drama',
 'Romance',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Romance',
 'Thriller',
 'Film-Noir',
 'Sci-Fi',
 'Comedy',
 'Comedy',
 'Romance',
 'Thriller',
 'Action',
 'Drama',
 'Action',
 'Adventure',
 "Children's",
 'Sci-Fi',
 'Action',
 'Adventure',
 'Thriller',
 'Action',
 'Documentary',
 'Comedy',
 'Romance',
 "Children's",
 'Comedy',
 'Musical',
 'Action',
 'Adventure',
 'Comedy',
 'Western',
 'Thriller',
 'Action',
 'Crime',
 'Romance',
 'Documentary',
 'Drama',
 'Action',
 'Adventure',
 'Animation',
 "Children's",
 'Fantasy',
 'Comedy',
 'Drama',
 'Thriller',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 'Horror',
 'Comedy',
 'Romance',
 'Drama',
 'Comedy',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Comedy',
 'Drama',
 "Children's",
 'Comedy',
 'Comedy',
 'Adventure',
 "Children's",
 'Drama',
 'Mystery',
 'Thriller',
 'Drama',
 'Documentary',
 'Comedy',
 'Comedy',
 'Drama',
 'Drama',
 'Comedy',
 "Children's",
 'Comedy',
 'Comedy',
 'Romance',
 'Thriller',
 'Animation',
 "Children's",
 'Comedy',
 'Musical',
 'Action',
 'Sci-Fi',
 'Thriller',
 'Adventure',
 ...]

genres=pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

gen=movies.genres[0]
gen.split("|")

['Animation', "Children's", 'Comedy']

dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                              1.0
Genre_Children's                             1.0
Genre_Comedy                                 1.0
Genre_Adventure                              0.0
Genre_Fantasy                                0.0
Genre_Romance                                0.0
Genre_Drama                                  0.0
Genre_Action                                 0.0
Genre_Crime                                  0.0
Genre_Thriller                               0.0
Genre_Horror                                 0.0
Genre_Sci-Fi                                 0.0
Genre_Documentary                            0.0
Genre_War                                    0.0
Genre_Musical                                0.0
Genre_Mystery                                0.0
Genre_Film-Noir                              0.0
Genre_Western                                0.0
Name: 0, dtype: object

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')
movies[:10]

C:\Users\Dell\AppData\Local\Temp\ipykernel_26068\3411970987.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table('F:/项目学习/利用Pyhon进行数据分析（第二版）/利用Pyhon进行数据分析/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep='::', header=None, names=mnames,encoding='ISO-8859-1')

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children's
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

dummies_demo = movies['genres'].str.get_dummies('|')
prefix = 'genre_'
dummies_demo = dummies_demo.add_prefix(prefix)

# 合并数据集
merged_df = pd.concat([movies, dummies_demo], axis=1)
# merged_df = merged_df.drop(columns=['genres'])

merged_df

	movie_id	title	genres	genre_Action	genre_Adventure	genre_Animation	genre_Children's	genre_Comedy	genre_Crime	genre_Documentary	...	genre_Fantasy	genre_Film-Noir	genre_Horror	genre_Musical	genre_Mystery	genre_Romance	genre_Sci-Fi	genre_Thriller	genre_War	genre_Western
0	1	Toy Story (1995)	Animation\|Children's\|Comedy	0	0	1	1	1	0	0	...	0	0	0	0	0	0	0	0	0	0
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy	0	1	0	1	0	0	0	...	1	0	0	0	0	0	0	0	0	0
2	3	Grumpier Old Men (1995)	Comedy\|Romance	0	0	0	0	1	0	0	...	0	0	0	0	0	1	0	0	0	0
3	4	Waiting to Exhale (1995)	Comedy\|Drama	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
4	5	Father of the Bride Part II (1995)	Comedy	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3878	3948	Meet the Parents (2000)	Comedy	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
3879	3949	Requiem for a Dream (2000)	Drama	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3880	3950	Tigerland (2000)	Drama	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3881	3951	Two Family House (2000)	Drama	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3882	3952	Contender, The (2000)	Drama\|Thriller	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0

3883 rows × 21 columns

一个对统计应用有用的秘诀是：结合get_dummies和诸如cut之类的离散化函数

np.random.seed(12345)
values=np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

bins= [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values,bins))

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	False	False	False	False	True
1	False	True	False	False	False
2	True	False	False	False	False
3	False	True	False	False	False
4	False	False	True	False	False
5	False	False	True	False	False
6	False	False	False	False	True
7	False	False	False	True	False
8	False	False	False	True	False
9	False	False	False	True	False

字符串操作

Python能够成为流行的数据处理语言，部分原因是其简单易用的字符串和文本处理
功能。

字符串对象方法

val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

split常常与strip一起使用，以去除空白符（包括换行符）：

pieces=[x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

'::'.join(pieces)

'a::b::guido'

检测子串的最佳方式是利用Python的in关键字，还可
以使用index和find：

'guido' in val

True

 val.index(',')

val.find(':')

-1

注意find和index的区别：如果找不到字符串，index将会引发一个异常（而不是返回
－1）

val.index(':')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[88], line 1
----> 1 val.index(':')


ValueError: substring not found

count可以返回指定子串的出现次数

val.count(',')

replace用于将指定模式替换为另一个模式。通过传入空字符串，它也常常用于删除

val.replace(',','::')

'a::b:: guido'

val.replace(',','')

'ab guido'

Python内置的字符串方法

count:返回在字符串中的出现次数（非重叠）
endswith、startswith:返回字符串以某个后缀结尾（以某个前缀结尾），则返回True
join:将字符串用作连接其他字符串序列的分隔符
index:如果在字符串中找到子串，则返回子串第一个字符所在的位置，如果没有找到，则引发ValueError
find: 如果在字符串中找到子串，则返回第一个发现子串第一个字符所在的位置，如果没有找到，返回-1
rfind:如果在字符串中找到子串，则返回最后一个发现子串第一个字符所在的位置，如果没有找到，返回-1
replace:用另一个字符替换指定字符
strip.rstrip.lstrip:去除空白符（包括换行符）
split:通过指定的分隔符将字符串拆分成一组子串
lower、upper:大小写
ljust、rjust:用空格（或其他字符）填充字符串的空白侧以返回符合最低宽度的字符串

正则表达式

re模块的函数可以分为三个大类：模式匹配、替换以及拆分

import re
text = "foo bar\t baz \tqux"
re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']

可以用re.compile自己编译regex以得到一个可重用的regex对象

regex=re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

regex.findall(text)#findall返回的是字符串中所有的匹配项

[' ', '\t ', ' \t']

match和search跟findall功能类似

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)

 regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

text[m.start():m.end()]

'dave@google.com'

print(regex.match(text))

None

sub方法可以将匹配到的模式替换为指定字符串

print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

正则表达式方法

findall、finditer:返回字符串中所有的非重叠匹配模式。findall返回的是有所有模式组成的列表，而finditer则通过一个迭代器逐个返回
match:从字符串起始位置匹配模式，还可以对模式各个部分进行分组。如果匹配到模式，则返回一个匹配项对象，否则返回None
search:扫描整个字符串以匹配模式，如果找到则返回一个匹配项对象。跟match不同，其匹配项可以位于字符串的任意位置，而不仅仅是起始处
split:根据找到的模式将字符串拆分成数段
sub、subn:将字符串中所有的（sub）或前n个（subn）模式替换成指定表达式。在替换字符串中可以通过\1、\2等符号表示各分项组

pandas的矢量化字符串函数

清理待分析的散乱数据时，常常需要做一些字符串规整化工作。更为复杂的情况
是，含有字符串的列有时还含有缺失数据：

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}
data=pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

有两个办法可以实现矢量化的元素获取操作：要么使用str.get，要么在str属性上使
用索引：

matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jxFV7MY2-1692080773017)(image.png)]

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)