复习:这门课程得主要目的是通过真实的数据,以实战的方式了解数据分析的流程和熟悉数据分析python的基本操作。知道了课程的目的之后,我们接下来我们要正式的开始数据分析的实战教学,完成kaggle上泰坦尼克的任务,实战数据分析全流程。
这里有两份资料:
教材《Python for Data Analysis》和 baidu.com & google.com(善用搜索引擎)
1 第一章:数据载入及初步观察
1.1 载入数据
数据集下载 https://www.kaggle.com/c/titanic/overview
1.1.1 任务一:导入numpy和pandas
import numpy as np
import pandas as pd
【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库
1.1.2 任务二:载入数据
(1) 使用相对路径载入数据
(2) 使用绝对路径载入数据
import os
os.getcwd()
'/Users/liubaoyun/Desktop/Datawhale 数据分析/Datawhale数据分析/第一单元项目集合'
abs_path = os.path.abspath('train.csv')
abs_path
'/Users/liubaoyun/Desktop/Datawhale 数据分析/Datawhale数据分析/第一单元项目集合/train.csv'
abs_train = pd.read_csv(abs_path)
rel_train = pd.read_csv('train.csv')
train = abs_train
【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。
【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?
【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。
1.1.3 任务三:每1000行为一个数据模块,逐块读取
pd.options.display.max_rows=10
train
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
---|
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
---|
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
---|
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
---|
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
---|
891 rows × 12 columns
chunker = pd.read_csv('train.csv',chunksize=100)
chunker
<pandas.io.parsers.TextFileReader at 0x7fa713369250>
for piece in chunker:
print('chunk_train')
print('\n')
print(piece)
chunk_train
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
95 96 0 3
96 97 0 1
97 98 1 1
98 99 1 2
99 100 0 2
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. ... ... ... ...
95 Shorney, Mr. Charles Joseph male NaN 0
96 Goldschmidt, Mr. George B male 71.0 0
97 Greenfield, Mr. William Bertram male 23.0 0
98 Doling, Mrs. John T (Ada Julia Bone) female 34.0 0
99 Kantor, Mr. Sinai male 34.0 1
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
95 0 374910 8.0500 NaN S
96 0 PC 17754 34.6542 A5 C
97 1 PC 17759 63.3583 D10 D12 C
98 1 231919 23.0000 NaN S
99 0 244367 26.0000 NaN S
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass Name \
100 101 0 3 Petranec, Miss. Matilda
101 102 0 3 Petroff, Mr. Pastcho ("Pentcho")
102 103 0 1 White, Mr. Richard Frasar
103 104 0 3 Johansson, Mr. Gustaf Joel
104 105 0 3 Gustafsson, Mr. Anders Vilhelm
.. ... ... ... ...
195 196 1 1 Lurette, Miss. Elise
196 197 0 3 Mernagh, Mr. Robert
197 198 0 3 Olsen, Mr. Karl Siegwart Andreas
198 199 1 3 Madigan, Miss. Margaret "Maggie"
199 200 0 2 Yrois, Miss. Henriette ("Mrs Harbeck")
Sex Age SibSp Parch Ticket Fare Cabin Embarked
100 female 28.0 0 0 349245 7.8958 NaN S
101 male NaN 0 0 349215 7.8958 NaN S
102 male 21.0 0 1 35281 77.2875 D26 S
103 male 33.0 0 0 7540 8.6542 NaN S
104 male 37.0 2 0 3101276 7.9250 NaN S
.. ... ... ... ... ... ... ... ...
195 female 58.0 0 0 PC 17569 146.5208 B80 C
196 male NaN 0 0 368703 7.7500 NaN Q
197 male 42.0 0 1 4579 8.4042 NaN S
198 female NaN 0 0 370370 7.7500 NaN Q
199 female 24.0 0 0 248747 13.0000 NaN S
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass \
200 201 0 3
201 202 0 3
202 203 0 3
203 204 0 3
204 205 1 3
.. ... ... ...
295 296 0 1
296 297 0 3
297 298 0 1
298 299 1 1
299 300 1 1
Name Sex Age SibSp \
200 Vande Walle, Mr. Nestor Cyriel male 28.0 0
201 Sage, Mr. Frederick male NaN 8
202 Johanson, Mr. Jakob Alfred male 34.0 0
203 Youseff, Mr. Gerious male 45.5 0
204 Cohen, Mr. Gurshon "Gus" male 18.0 0
.. ... ... ... ...
295 Lewy, Mr. Ervin G male NaN 0
296 Hanna, Mr. Mansour male 23.5 0
297 Allison, Miss. Helen Loraine female 2.0 1
298 Saalfeld, Mr. Adolphe male NaN 0
299 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 0
Parch Ticket Fare Cabin Embarked
200 0 345770 9.5000 NaN S
201 2 CA. 2343 69.5500 NaN S
202 0 3101264 6.4958 NaN S
203 0 2628 7.2250 NaN C
204 0 A/5 3540 8.0500 NaN S
.. ... ... ... ... ...
295 0 PC 17612 27.7208 NaN C
296 0 2693 7.2292 NaN C
297 2 113781 151.5500 C22 C26 S
298 0 19988 30.5000 C106 S
299 1 PC 17558 247.5208 B58 B60 C
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass Name \
300 301 1 3 Kelly, Miss. Anna Katherine "Annie Kate"
301 302 1 3 McCoy, Mr. Bernard
302 303 0 3 Johnson, Mr. William Cahoone Jr
303 304 1 2 Keane, Miss. Nora A
304 305 0 3 Williams, Mr. Howard Hugh "Harry"
.. ... ... ... ...
395 396 0 3 Johansson, Mr. Erik
396 397 0 3 Olsson, Miss. Elina
397 398 0 2 McKane, Mr. Peter David
398 399 0 2 Pain, Dr. Alfred
399 400 1 2 Trout, Mrs. William H (Jessie L)
Sex Age SibSp Parch Ticket Fare Cabin Embarked
300 female NaN 0 0 9234 7.7500 NaN Q
301 male NaN 2 0 367226 23.2500 NaN Q
302 male 19.0 0 0 LINE 0.0000 NaN S
303 female NaN 0 0 226593 12.3500 E101 Q
304 male NaN 0 0 A/5 2466 8.0500 NaN S
.. ... ... ... ... ... ... ... ...
395 male 22.0 0 0 350052 7.7958 NaN S
396 female 31.0 0 0 350407 7.8542 NaN S
397 male 46.0 0 0 28403 26.0000 NaN S
398 male 23.0 0 0 244278 10.5000 NaN S
399 female 28.0 0 0 240929 12.6500 NaN S
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass \
400 401 1 3
401 402 0 3
402 403 0 3
403 404 0 3
404 405 0 3
.. ... ... ...
495 496 0 3
496 497 1 1
497 498 0 3
498 499 0 1
499 500 0 3
Name Sex Age SibSp \
400 Niskanen, Mr. Juha male 39.0 0
401 Adams, Mr. John male 26.0 0
402 Jussila, Miss. Mari Aina female 21.0 1
403 Hakkarainen, Mr. Pekka Pietari male 28.0 1
404 Oreskovic, Miss. Marija female 20.0 0
.. ... ... ... ...
495 Yousseff, Mr. Gerious male NaN 0
496 Eustis, Miss. Elizabeth Mussey female 54.0 1
497 Shellard, Mr. Frederick William male NaN 0
498 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 1
499 Svensson, Mr. Olof male 24.0 0
Parch Ticket Fare Cabin Embarked
400 0 STON/O 2. 3101289 7.9250 NaN S
401 0 341826 8.0500 NaN S
402 0 4137 9.8250 NaN S
403 0 STON/O2. 3101279 15.8500 NaN S
404 0 315096 8.6625 NaN S
.. ... ... ... ... ...
495 0 2627 14.4583 NaN C
496 0 36947 78.2667 D20 C
497 0 C.A. 6212 15.1000 NaN S
498 2 113781 151.5500 C22 C26 S
499 0 350035 7.7958 NaN S
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass \
500 501 0 3
501 502 0 3
502 503 0 3
503 504 0 3
504 505 1 1
.. ... ... ...
595 596 0 3
596 597 1 2
597 598 0 3
598 599 0 3
599 600 1 1
Name Sex Age SibSp Parch \
500 Calic, Mr. Petar male 17.0 0 0
501 Canavan, Miss. Mary female 21.0 0 0
502 O'Sullivan, Miss. Bridget Mary female NaN 0 0
503 Laitinen, Miss. Kristina Sofia female 37.0 0 0
504 Maioni, Miss. Roberta female 16.0 0 0
.. ... ... ... ... ...
595 Van Impe, Mr. Jean Baptiste male 36.0 1 1
596 Leitch, Miss. Jessie Wills female NaN 0 0
597 Johnson, Mr. Alfred male 49.0 0 0
598 Boulos, Mr. Hanna male NaN 0 0
599 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") male 49.0 1 0
Ticket Fare Cabin Embarked
500 315086 8.6625 NaN S
501 364846 7.7500 NaN Q
502 330909 7.6292 NaN Q
503 4135 9.5875 NaN S
504 110152 86.5000 B79 S
.. ... ... ... ...
595 345773 24.1500 NaN S
596 248727 33.0000 NaN S
597 LINE 0.0000 NaN S
598 2664 7.2250 NaN C
599 PC 17485 56.9292 A20 C
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass \
600 601 1 2
601 602 0 3
602 603 0 1
603 604 0 3
604 605 1 1
.. ... ... ...
695 696 0 2
696 697 0 3
697 698 1 3
698 699 0 1
699 700 0 3
Name Sex Age SibSp \
600 Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr... female 24.0 2
601 Slabenoff, Mr. Petco male NaN 0
602 Harrington, Mr. Charles H male NaN 0
603 Torber, Mr. Ernst William male 44.0 0
604 Homer, Mr. Harry ("Mr E Haven") male 35.0 0
.. ... ... ... ...
695 Chapman, Mr. Charles Henry male 52.0 0
696 Kelly, Mr. James male 44.0 0
697 Mullens, Miss. Katherine "Katie" female NaN 0
698 Thayer, Mr. John Borland male 49.0 1
699 Humblen, Mr. Adolf Mathias Nicolai Olsen male 42.0 0
Parch Ticket Fare Cabin Embarked
600 1 243847 27.0000 NaN S
601 0 349214 7.8958 NaN S
602 0 113796 42.4000 NaN S
603 0 364511 8.0500 NaN S
604 0 111426 26.5500 NaN C
.. ... ... ... ... ...
695 0 248731 13.5000 NaN S
696 0 363592 8.0500 NaN S
697 0 35852 7.7333 NaN Q
698 1 17421 110.8833 C68 C
699 0 348121 7.6500 F G63 S
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass \
700 701 1 1
701 702 1 1
702 703 0 3
703 704 0 3
704 705 0 3
.. ... ... ...
795 796 0 2
796 797 1 1
797 798 1 3
798 799 0 3
799 800 0 3
Name Sex Age SibSp \
700 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.0 1
701 Silverthorne, Mr. Spencer Victor male 35.0 0
702 Barbara, Miss. Saiide female 18.0 0
703 Gallagher, Mr. Martin male 25.0 0
704 Hansen, Mr. Henrik Juul male 26.0 1
.. ... ... ... ...
795 Otter, Mr. Richard male 39.0 0
796 Leader, Dr. Alice (Farnham) female 49.0 0
797 Osman, Mrs. Mara female 31.0 0
798 Ibrahim Shawah, Mr. Yousseff male 30.0 0
799 Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go... female 30.0 1
Parch Ticket Fare Cabin Embarked
700 0 PC 17757 227.5250 C62 C64 C
701 0 PC 17475 26.2875 E24 S
702 1 2691 14.4542 NaN C
703 0 36864 7.7417 NaN Q
704 0 350025 7.8542 NaN S
.. ... ... ... ... ...
795 0 28213 13.0000 NaN S
796 0 17465 25.9292 D17 S
797 0 349244 8.6833 NaN S
798 0 2685 7.2292 NaN C
799 1 345773 24.1500 NaN S
[100 rows x 12 columns]
chunk_train
PassengerId Survived Pclass \
800 801 0 2
801 802 1 2
802 803 1 1
803 804 1 3
804 805 1 3
.. ... ... ...
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3
Name Sex Age SibSp Parch \
800 Ponesell, Mr. Martin male 34.00 0 0
801 Collyer, Mrs. Harvey (Charlotte Annie Tate) female 31.00 1 1
802 Carter, Master. William Thornton II male 11.00 1 2
803 Thomas, Master. Assad Alexander male 0.42 0 1
804 Hedman, Mr. Oskar Arvid male 27.00 0 0
.. ... ... ... ... ...
886 Montvila, Rev. Juozas male 27.00 0 0
887 Graham, Miss. Margaret Edith female 19.00 0 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2
889 Behr, Mr. Karl Howell male 26.00 0 0
890 Dooley, Mr. Patrick male 32.00 0 0
Ticket Fare Cabin Embarked
800 250647 13.0000 NaN S
801 C.A. 31921 26.2500 NaN S
802 113760 120.0000 B96 B98 S
803 2625 8.5167 NaN C
804 347089 6.9750 NaN S
.. ... ... ... ...
886 211536 13.0000 NaN S
887 112053 30.0000 B42 S
888 W./C. 6607 23.4500 NaN S
889 111369 30.0000 C148 C
890 370376 7.7500 NaN Q
[91 rows x 12 columns]
【思考】什么是逐块读取?为什么要逐块读取呢?
【提示】大家可以chunker(数据块)是什么类型?用for
循环打印出来出处具体的样子是什么?
1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]
PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口
train.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
train.columns=['乘客ID','是否幸存','乘客等级(1/2/3等舱位)',
'乘客姓名','性别','年龄','堂兄弟/妹个数',
'父母与小孩个数','船票信息','票价','客舱','登船港口'
]
train
| 乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
---|
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
---|
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
---|
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
---|
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
---|
891 rows × 12 columns
【思考】所谓将表头改为中文其中一个思路是:将英文列名表头替换成中文。还有其他的方法吗?
1.2 初步观察
导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等
1.2.1 任务一:查看数据的基本信息
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 乘客ID 891 non-null int64
1 是否幸存 891 non-null int64
2 乘客等级(1/2/3等舱位) 891 non-null int64
3 乘客姓名 891 non-null object
4 性别 891 non-null object
5 年龄 714 non-null float64
6 堂兄弟/妹个数 891 non-null int64
7 父母与小孩个数 891 non-null int64
8 船票信息 891 non-null object
9 票价 891 non-null float64
10 客舱 204 non-null object
11 登船港口 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
【提示】有多个函数可以这样做,你可以做一下总结
1.2.2 任务二:观察表格前10行的数据和后15行的数据
train[:10]
| 乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
---|
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
---|
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
---|
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
---|
train[-15:]
| 乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 |
---|
876 | 877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
---|
877 | 878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
---|
878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
---|
879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
---|
880 | 881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
---|
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
---|
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
---|
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
---|
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
---|
15 rows × 12 columns
1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False
train.isnull()
| 乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 |
---|
0 | False | False | False | False | False | False | False | False | False | False | True | False |
---|
1 | False | False | False | False | False | False | False | False | False | False | False | False |
---|
2 | False | False | False | False | False | False | False | False | False | False | True | False |
---|
3 | False | False | False | False | False | False | False | False | False | False | False | False |
---|
4 | False | False | False | False | False | False | False | False | False | False | True | False |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
886 | False | False | False | False | False | False | False | False | False | False | True | False |
---|
887 | False | False | False | False | False | False | False | False | False | False | False | False |
---|
888 | False | False | False | False | False | True | False | False | False | False | True | False |
---|
889 | False | False | False | False | False | False | False | False | False | False | False | False |
---|
890 | False | False | False | False | False | False | False | False | False | False | True | False |
---|
891 rows × 12 columns
【总结】上面的操作都是数据分析中对于数据本身的观察
【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助
1.3 保存数据
1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv
train.to_csv('train_chinese.csv',encoding = 'utf8')
c= pd.read_csv('train_chinese.csv')
c.head()
| Unnamed: 0 | 乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 |
---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)