练习1
• 读取北向.csv 指定trade_date为行索引
• 查看数据的基本信息 有无缺失值 对其缺失值进行处理
• 删除缺失值所在行
• 查看数据的基本信息 查看数据是否清洗完毕
• index列没啥用 将index列删除
• 观察数据是否有重复行
• 将重复行进行删除
• 将行索引 进行升序
• 将处理好的数据 保存至 北向(副).csv
'''
作业03
练习1
• 读取北向.csv 指定trade_date为 行索引
• 查看数据的基本信息 有无缺失值 对其缺失值进行处理
• 删除缺失值所在行
• 查看数据的基本信息 查看数据是否清洗完毕
• index列没啥用 将index列删除
• 观察数据是否有重复行
• 将重复行进行删除
• 将行索引 进行升序
• 将处理好的数据 保存至 北向(副).csv
'''
import pandas as pd
import numpy as np
northward_data = pd.read_csv("北向.csv",index_col=1)
'''
index ggt_ss ggt_sz hgt sgt north_money south_money
trade_date
20190624 0 -541.17 792.38 -757.96 -1153.14 -1911.10 251.21
20190621 1 -97.40 701.36 3722.36 3608.14 7330.50 603.96
20190620 2 660.05 555.23 1914.44 3650.47 5564.91 1215.28
20190619 3 -491.58 186.47 2092.51 2831.23 4923.74 -305.11
20190618 4 1667.40 832.29 974.92 617.24 1592.16 2499.69
'''
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884 entries, 20190624 to 20190605
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 884 non-null int64
1 ggt_ss 850 non-null float64
2 ggt_sz 850 non-null float64
3 hgt 870 non-null float64
4 sgt 870 non-null float64
5 north_money 884 non-null float64
6 south_money 884 non-null float64
dtypes: float64(6), int64(1)
memory usage: 55.2 KB
None
'''
print(northward_data.isnull().sum())
'''
trade_date 0
ggt_ss 34
ggt_sz 34
hgt 14
sgt 14
north_money 0
south_money 0
dtype: int64
'''
print(northward_data.isnull().any())
'''
trade_date False
ggt_ss True
ggt_sz True
hgt True
sgt True
north_money False
south_money False
dtype: bool
'''
'''
[[False False False ... False False False]
[False False False ... False False False]
[False False False ... False False False]
...
[False False False ... False False False]
[False False False ... False False False]
[False False False ... False False False]]
'''
'''
any()一个序列中满足一个True,则返回True;
all()一个序列中所有值为True时,返回True,否则为False。
'''
if northward_data.isnull().values.any():
print('存在nan数据')
else:
print('不存在nan数据')
'''
index ggt_ss ggt_sz hgt sgt north_money south_money
trade_date
20190510 29 1763.55 1294.88 NaN NaN 0.00 3058.43
20190510 29 1763.55 1294.88 NaN NaN 0.00 3058.43
20190430 34 NaN NaN -463.88 -83.01 -546.89 0.00
20190430 34 NaN NaN -463.88 -83.01 -546.89 0.00
20190429 35 NaN NaN 3819.12 713.61 4532.73 0.00
... ... ... ... ... ... ... ...
20190912 230 NaN NaN 2488.84 806.64 3295.48 0.00
20190911 231 NaN NaN 1804.94 627.08 2432.02 0.00
20190911 231 NaN NaN 1804.94 627.08 2432.02 0.00
20190628 283 1411.88 218.06 NaN NaN 0.00 1629.94
20190628 283 1411.88 218.06 NaN NaN 0.00 1629.94
[96 rows x 7 columns]
'''
northward_data.dropna(axis=0,inplace=True)
print(northward_data.info())
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 836 entries, 20190624 to 20190605
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 836 non-null int64
1 ggt_ss 836 non-null float64
2 ggt_sz 836 non-null float64
3 hgt 836 non-null float64
4 sgt 836 non-null float64
5 north_money 836 non-null float64
6 south_money 836 non-null float64
dtypes: float64(6), int64(1)
memory usage: 52.2 KB
None
处理后不存在nan数据
Process finished with exit code 0
'''
if northward_data.isnull().values.any():
print('处理后,仍然存在nan数据')
else:
print('处理后不存在nan数据')
northward_data.drop(['index'],axis=1,inplace=True)
if northward_data.duplicated().values.any():
print('存在重复行')
else:
print('不存在重复行')
northward_data = northward_data.drop_duplicates()
print(northward_data)
if northward_data.duplicated().values.any():
print('处理后,仍然存在重复行')
else:
print('处理后,不存在重复行')
northward_data = northward_data.sort_index(ascending=False)
print(northward_data)
'''
[822 rows x 6 columns]
处理后,不存在重复行
ggt_ss ggt_sz hgt sgt north_money south_money
trade_date
20200904 1977.81 2827.44 -3106.19 -3211.42 -6317.61 4805.25
20200903 80.46 1470.66 917.30 -1921.12 -1003.82 1551.12
20200902 1113.10 2915.24 -6007.57 -606.69 -6614.26 4028.34
20200901 617.59 2336.67 -1805.03 1505.10 -299.93 2954.26
20200831 2231.93 2660.40 -3414.43 -4611.59 -8026.02 4892.33
... ... ... ... ... ... ...
20170109 1175.00 337.00 -846.78 886.18 39.40 1512.00
20170106 385.00 207.00 -2103.00 644.00 -1459.00 592.00
20170105 686.00 291.00 -912.34 633.50 -278.84 977.00
20170104 593.00 258.00 -1348.85 629.70 -719.15 851.00
20170103 994.00 277.00 -582.47 615.49 33.02 1271.00
[822 rows x 6 columns]
'''
northward_data.to_csv('北向(副).csv')
练习2
读取 FoodFacts.csv 数据,该数据是全球食品数据,需分析每个国家添加剂的平均使用。
步骤分析
• 1.读取数据
• 2.数据质量考量
• 3.清洗数据
• 4.对各个国家的使用数量进行统计
•4.1.清洗,统计国家数据
• 4.2 .通过国家统计添加剂用量
• 5.保存统计结果
import pandas as pd
import numpy as np
'''
练习2
读取 FoodFacts.csv 数据,该数据是全球食品数据,需分析每个国家添加剂的平均使用。
步骤分析
• 1.读取数据
• 2.数据质量考量
• 3.清洗数据
• 4.对各个国家的使用数量进行统计
• 1.清洗,统计国家数据
• 2.通过国家统计添加剂用量
• 5.保存统计结果
'''
'''
系统:1:DtypeWarning:列(0,3,5,27,36)已混合类型。指定导入时的dtype选项或将低内存设置为False。
'''
'''
code ... nutrition_score_uk_100g
0 000000000000012866 ... NaN
1 0000000024600 ... NaN
2 0000000036252 ... NaN
3 0000000039259 ... NaN
4 0000000039529 ... NaN
'''
'''
['code' 'url' 'creator' 'created_t' 'created_datetime' 'last_modified_t'
'last_modified_datetime' 'product_name' 'generic_name' 'quantity'
'packaging' 'packaging_tags' 'brands' 'brands_tags' 'categories'
'categories_tags' 'categories_en' 'origins' 'origins_tags'
'manufacturing_places' 'manufacturing_places_tags' 'labels' 'labels_tags'
......
'cocoa_100g' 'chlorophyl_100g' 'carbon_footprint_100g'
'nutrition_score_fr_100g' 'nutrition_score_uk_100g']
'''
def del_NaN(pandas_data):
'''
:param pandas_data: 源数据
:return: 返回删除NaN后的数据
'''
if pandas_data.isnull().values.any():
print('pandas_data存在nan数据,马上进行处理')
else:
print('pandas_data不存在nan数据')
return pandas_data
return pandas_data.dropna()
def clean_countries_en_data(pandas_data):
'''
:param pandas_data: 源数据
:return: 返回处理后的数据
'''
without_commas_data = pandas_data[~pandas_data['countries_en'].str.contains(r",")]
print('去逗号',without_commas_data)
'''
去逗号 countries_en additives_n
5 United Kingdom 0.0
6 France 0.0
8 France 0.0
10 United Kingdom 5.0
11 United Kingdom 5.0
... ... ...
65480 United States 4.0
65490 France 0.0
65494 France 0.0
65499 France 0.0
65501 France 0.0
[42319 rows x 2 columns]
'''
without_commas_data["countries_en"]= without_commas_data["countries_en"].map(lambda x:x.lower())
print('大小写转换without_commas_data=',without_commas_data)
'''
大小写转换without_commas_data= countries_en additives_n
5 united kingdom 0.0
6 france 0.0
8 france 0.0
10 united kingdom 5.0
11 united kingdom 5.0
... ... ...
65480 united states 4.0
65490 france 0.0
65494 france 0.0
65499 france 0.0
65501 france 0.0
[42319 rows x 2 columns]
'''
return without_commas_data
if __name__ == '__main__':
pd.set_option('mode.chained_assignment', None)
food_brief_data = pd.read_csv('FoodFacts.csv', usecols=["countries_en","additives_n"])
print('food_brief_data111',food_brief_data)
'''
countries_en additives_n
0 France NaN
1 France NaN
2 France NaN
3 France NaN
4 France NaN
... ... ...
65498 Poland NaN
65499 France 0.0
65500 France NaN
65501 France 0.0
65502 China NaN
[65503 rows x 2 columns]
'''
food_brief_data = del_NaN(food_brief_data)
print('food_brief_data222',food_brief_data)
'''
food_brief_data222 countries_en additives_n
5 United Kingdom 0.0
6 France 0.0
8 France 0.0
10 United Kingdom 5.0
11 United Kingdom 5.0
... ... ...
65480 United States 4.0
65490 France 0.0
65494 France 0.0
65499 France 0.0
65501 France 0.0
[43616 rows x 2 columns]
[43616 rows x 3 columns]
Process finished with exit code 0
'''
food_brief_data = clean_countries_en_data(food_brief_data)
print('food_brief_data222',food_brief_data)
'''
food_brief_data222 countries_en additives_n
5 united kingdom 0.0
6 france 0.0
10 united kingdom 5.0
13 france 2.0
15 spain 0.0
... ... ...
65304 new zealand 6.0
65326 new zealand 8.0
65341 new zealand 7.0
65403 burkina faso 2.0
65405 burkina faso 1.0
[411 rows x 2 columns]
'''
last_DataFrame = food_brief_data.groupby(['countries_en']).mean()
print(type(last_DataFrame))
last_DataFrame.rename(columns={'additives_n':'additives_mean'}, inplace = True)
print('last_DataFrame=', last_DataFrame)
last_DataFrame = last_DataFrame.sort_values(by='additives_mean',ascending=False)
'''
<class 'pandas.core.frame.DataFrame'>
last_DataFrame= additives_mean
countries_en
albania 0.000000
algeria 3.500000
andorra 0.000000
argentina 2.222222
australia 0.489871
... ...
turkey 0.300000
united arab emirates 1.000000
united kingdom 1.243810
united states 2.162905
venezuela 0.000000
[84 rows x 1 columns]
'''
last_DataFrame.to_csv('按国家统计添加剂使用情况.csv', encoding='utf_8_sig')
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)