This is the 8th day of my participation in the First Challenge 2022. For details: First Challenge 2022.

Introduction to the

On April 15, 1912, the supposedly unsinkable Titanic sank due to a collision with an iceberg. Without adequate rescue equipment, 1,502 of the 2,224 passengers died. The accident happened, but can we detect any patterns in the historical data from the Titanic? In this article, you will learn how to use Pandas for data analysis.

Titanic passenger data

We have downloaded some Titanic passenger data from Kaggle, which includes the following fields:

The variable name	meaning	The values
survival	Whether to survive	0 = No, 1 = Yes
pclass	Class of ticket	1 = 1st, 2 = 2nd, 3 = 3rd
sex	gender
Age	age
sibsp	Spouse information
parch	Parent or child information
ticket	The ticket price coding
fare	The ship fee
cabin	The cabin number
embarked	Port of entry	C = Cherbourg, Q = Queenstown, S = Southampton

The downloaded file is a CSV file. Let’s see how pandas can be used for data analysis.

Analyze the data using PANDAS

Importing dependency packages

This article mainly uses PANDAS and matplotlib, so the following general Settings need to be set up first:

from numpy.random import randn
import numpy as np
np.random.seed(123)
import os
import matplotlib.pyplot as plt
import pandas as pd
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4)
pd.options.display.max_rows = 20
Copy the code

Read and analyze data

Pandas provides a read_csv method to read and convert CSV data to a DataFrame:

path = '.. /data/titanic.csv' df = pd.read_csv(path) dfCopy the code

Let’s look at the data we read:

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S
5	897	3	Svensson, Mr. Johan Cervin	male	14.0	0	0	7538	9.2250	NaN	S
6	898	3	Connolly, Miss. Kate	female	30.0	0	0	330972	7.6292	NaN	Q
7	899	2	Caldwell, Mr. Albert Francis	male	26.0	1	1	248738	29.0000	NaN	S
8	900	3	Abrahim, Mrs. Joseph (Sophie Halaut Easu)	female	18.0	0	0	2657	7.2292	NaN	C
9	901	3	Davies, Mr. John Samuel	male	21.0	2	0	A/4 48871	24.1500	NaN	S
.	.	.	.	.	.	.	.	.	.	.	.
408	1300	3	Riordan, Miss. Johanna Hannah””	female	NaN	0	0	334915	7.7208	NaN	Q
409	1301	3	Peacock, Miss. Treasteall	female	3.0	1	1	SOTON/O.Q. 3101315	13.7750	NaN	S
410	1302	3	Naughton, Miss. Hannah	female	NaN	0	0	365237	7.7500	NaN	Q
411	1303	1	Minahan, Mrs. William Edward (Lillian E Thorpe)	female	37.0	1	0	19928	90.0000	C78	Q
412	1304	3	Henriksson, Miss. Jenny Lovisa	female	28.0	0	0	347086	7.7750	NaN	S
413	1305	3	Spector, Mr. Woolf	male	NaN	0	0	A.5. 3236	8.0500	NaN	S
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S
416	1308	3	Ware, Mr. Frederick	male	NaN	0	0	359309	8.0500	NaN	S
417	1309	3	Peter, Master. Michael J	male	NaN	1	1	2668	22.3583	NaN	C

418 rows × 11 columns

Basic statistics can be viewed by calling the Describe method of DF:

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200

If you want to see the port where the passenger is logged in, you can choose:

df['Embarked'][:10]
Copy the code

0    Q
1    S
2    Q
3    S
4    S
5    S
6    Q
7    S
8    C
9    S
Name: Embarked, dtype: object
Copy the code

Value_counts counts them:

embark_counts=df['Embarked'].value_counts()
embark_counts[:10]
Copy the code

S    270
C    102
Q     46
Name: Embarked, dtype: int64
Copy the code

As can be seen from the results, 270 passengers have logged in from PORT S, 102 from port C and 46 from port Q.

Similarly, we can count the age information:

age_counts=df['Age'].value_counts()
age_counts.head(10)
Copy the code

The top 10 ages are as follows:

24.0    17
21.0    17
22.0    16
30.0    15
18.0    13
27.0    12
26.0    12
25.0    11
23.0    11
29.0    10
Name: Age, dtype: int64
Copy the code

Calculate the average age:

df['Age'].mean()
Copy the code

30.272590361445783
Copy the code

In fact, some data are age-free, and we can fill them in with averages:

clean_age1 = df['Age'].fillna(df['Age'].mean())
clean_age1.value_counts()
Copy the code

You can see that the average is 30.27 and the number is 86.

30.27259 86 24.00000 17 21.00000 17 22.00000 16 30.00000 15 18.00000 13 26.00000 12 27.00000 12 25.00000 11 23.00000 11 . 36.50000 1 40.50000 1 11.50000 1 34.00000 1 15.00000 1 7.00000 1 60.50000 1 26.50000 1 76.00000 1 34.50000 1 Name: Age, Length: 80, dtype: int64Copy the code

It may not be a good idea to use averages as ages. Another option is to discard averages:

clean_age2=df['Age'].dropna()
clean_age2
age_counts = clean_age2.value_counts()
ageset=age_counts.head(10)
ageset
Copy the code

24.0    17
21.0    17
22.0    16
30.0    15
18.0    13
27.0    12
26.0    12
25.0    11
23.0    11
29.0    10
Name: Age, dtype: int64
Copy the code

Graphical representation and matrix transformation

Graphics are very helpful for data analysis. We use bar charts to represent the ages of the top 10 obtained above:

import seaborn as sns
sns.barplot(x=ageset.index, y=ageset.values)
Copy the code

Now let’s do a complex matrix transformation. Let’s filter out the data where age and sex are null:

cframe=df[df.Age.notnull() & df.Sex.notnull()]
cframe
Copy the code

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S
5	897	3	Svensson, Mr. Johan Cervin	male	14.0	0	0	7538	9.2250	NaN	S
6	898	3	Connolly, Miss. Kate	female	30.0	0	0	330972	7.6292	NaN	Q
7	899	2	Caldwell, Mr. Albert Francis	male	26.0	1	1	248738	29.0000	NaN	S
8	900	3	Abrahim, Mrs. Joseph (Sophie Halaut Easu)	female	18.0	0	0	2657	7.2292	NaN	C
9	901	3	Davies, Mr. John Samuel	male	21.0	2	0	A/4 48871	24.1500	NaN	S
.	.	.	.	.	.	.	.	.	.	.	.
403	1295	1	Carrau, Mr. Jose Pedro	male	17.0	0	0	113059	47.1000	NaN	S
404	1296	1	Frauenthal, Mr. Isaac Gerald	male	43.0	1	0	17765	27.7208	D40	C
405	1297	2	Nourney, Mr. Alfred (Baron von Drachstedt”)”	male	20.0	0	0	SC/PARIS 2166	13.8625	D38	C
406	1298	2	Ware, Mr. William Jeffery	male	23.0	1	0	28666	10.5000	NaN	S
407	1299	1	Widener, Mr. George Dunton	male	50.0	1	1	113503	211.5000	C80	C
409	1301	3	Peacock, Miss. Treasteall	female	3.0	1	1	SOTON/O.Q. 3101315	13.7750	NaN	S
411	1303	1	Minahan, Mrs. William Edward (Lillian E Thorpe)	female	37.0	1	0	19928	90.0000	C78	Q
412	1304	3	Henriksson, Miss. Jenny Lovisa	female	28.0	0	0	347086	7.7750	NaN	S
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S

332 rows × 11 columns

Next use groupby to group age and sex:

by_sex_age = cframe.groupby(['Age', 'Sex'])
by_sex_age.size()
Copy the code

Age Sex 0.17 female 1 0.33 male 1 0.75 male 1 0.83 male 1 0.92 female 1 1.00 female 3 2.00 female 1 male 1 3.00 female 1 5.00 male 1.. 60.00 Female 3 60.50 male 1 61.00 male 2 62.00 male 1 63.00 Female 1 male 1 64.00 Female 2 male 1 67.00 male 1 76.00 female 1 Length: 115, dtype: int64Copy the code

Use unstack to convert Sex column data to rows:

Sex	female	male
Age
0.17	1.0	0.0
0.33	0.0	1.0
0.75	0.0	1.0
0.83	0.0	1.0
0.92	1.0	0.0
1.00	3.0	0.0
2.00	1.0	1.0
3.00	1.0	0.0
5.00	0.0	1.0
6.00	0.0	3.0
.	.	.
58.00	1.0	0.0
59.00	1.0	0.0
60.00	3.0	0.0
60.50	0.0	1.0
61.00	0.0	2.0
62.00	0.0	1.0
63.00	1.0	1.0
64.00	2.0	1.0
67.00	0.0	1.0
76.00	1.0	0.0

79 rows × 2 columns

We add the number of people with the same age and use argsort to get the sorted index:

indexer = agg_counts.sum(1).argsort()
indexer.tail(10)
Copy the code

Age
58.0    37
59.0    31
60.0    29
60.5    32
61.0    34
62.0    22
63.0    38
64.0    27
67.0    26
76.0    30
dtype: int64
Copy the code

Extract the last 10, the largest 10, from agG_COUNTS:

count_subset = agg_counts.take(indexer.tail(10))
count_subset=count_subset.tail(10)
count_subset
Copy the code

Sex	female	male
Age
29.0	5.0	5.0
25.0	1.0	10.0
23.0	5.0	6.0
26.0	4.0	8.0
27.0	4.0	8.0
18.0	7.0	6.0
30.0	6.0	9.0
22.0	10.0	6.0
21.0	3.0	14.0
24.0	5.0	12.0

The above operation can be simplified to the following code:

agg_counts.sum(1).nlargest(10)
Copy the code

Age
21.0    17.0
24.0    17.0
22.0    16.0
30.0    15.0
18.0    13.0
26.0    12.0
27.0    12.0
23.0    11.0
25.0    11.0
29.0    10.0
dtype: float64
Copy the code

Count_subset for stack operation, convenient for the following drawing:

stack_subset = count_subset.stack()
stack_subset
Copy the code

Age   Sex   
29.0  female     5.0
      male       5.0
25.0  female     1.0
      male      10.0
23.0  female     5.0
      male       6.0
26.0  female     4.0
      male       8.0
27.0  female     4.0
      male       8.0
18.0  female     7.0
      male       6.0
30.0  female     6.0
      male       9.0
22.0  female    10.0
      male       6.0
21.0  female     3.0
      male      14.0
24.0  female     5.0
      male      12.0
dtype: float64
Copy the code

stack_subset.name = 'total'
stack_subset = stack_subset.reset_index()
stack_subset
Copy the code

Age	Sex	total
0	29.0	female	5.0
1	29.0	male	5.0
2	25.0	female	1.0
3	25.0	male	10.0
4	23.0	female	5.0
5	23.0	male	6.0
6	26.0	female	4.0
7	26.0	male	8.0
8	27.0	female	4.0
9	27.0	male	8.0
10	18.0	female	7.0
11	18.0	male	6.0
12	30.0	female	6.0
13	30.0	male	9.0
14	22.0	female	10.0
15	22.0	male	6.0
16	21.0	female	3.0
17	21.0	male	14.0
18	24.0	female	5.0
19	24.0	male	12.0

The diagram is as follows:

sns.barplot(x='total', y='Age', hue='Sex',  data=stack_subset)
Copy the code

Examples for this article can be found at: github.com/ddean2009/l…

This article is available at www.flydean.com/01-pandas-t…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you don’t

Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Data Analysis Practical case: Pandas’ use in Titanic passenger data

Introduction to the

Titanic passenger data

Analyze the data using PANDAS

Importing dependency packages

Read and analyze data

Graphical representation and matrix transformation

Data Analysis Practical case: Pandas’ use in Titanic passenger data

Introduction to the

Titanic passenger data

Analyze the data using PANDAS

Importing dependency packages

Read and analyze data

Graphical representation and matrix transformation

Related Posts

240. Search two-dimensional matrix II

C# starter series 4 — constants and variables

Why switch Case when elseIf exists