This is the 8th day of my participation in the First Challenge 2022. For details: First Challenge 2022.
Introduction to the
On April 15, 1912, the supposedly unsinkable Titanic sank due to a collision with an iceberg. Without adequate rescue equipment, 1,502 of the 2,224 passengers died. The accident happened, but can we detect any patterns in the historical data from the Titanic? In this article, you will learn how to use Pandas for data analysis.
Titanic passenger data
We have downloaded some Titanic passenger data from Kaggle, which includes the following fields:
The variable name | meaning | The values |
---|---|---|
survival | Whether to survive | 0 = No, 1 = Yes |
pclass | Class of ticket | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | gender | |
Age | age | |
sibsp | Spouse information | |
parch | Parent or child information | |
ticket | The ticket price coding | |
fare | The ship fee | |
cabin | The cabin number | |
embarked | Port of entry | C = Cherbourg, Q = Queenstown, S = Southampton |
The downloaded file is a CSV file. Let’s see how pandas can be used for data analysis.
Analyze the data using PANDAS
Importing dependency packages
This article mainly uses PANDAS and matplotlib, so the following general Settings need to be set up first:
from numpy.random import randn
import numpy as np
np.random.seed(123)
import os
import matplotlib.pyplot as plt
import pandas as pd
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4)
pd.options.display.max_rows = 20
Copy the code
Read and analyze data
Pandas provides a read_csv method to read and convert CSV data to a DataFrame:
path = '.. /data/titanic.csv' df = pd.read_csv(path) dfCopy the code
Let’s look at the data we read:
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
5 | 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | NaN | S |
6 | 898 | 3 | Connolly, Miss. Kate | female | 30.0 | 0 | 0 | 330972 | 7.6292 | NaN | Q |
7 | 899 | 2 | Caldwell, Mr. Albert Francis | male | 26.0 | 1 | 1 | 248738 | 29.0000 | NaN | S |
8 | 900 | 3 | Abrahim, Mrs. Joseph (Sophie Halaut Easu) | female | 18.0 | 0 | 0 | 2657 | 7.2292 | NaN | C |
9 | 901 | 3 | Davies, Mr. John Samuel | male | 21.0 | 2 | 0 | A/4 48871 | 24.1500 | NaN | S |
. | . | . | . | . | . | . | . | . | . | . | . |
408 | 1300 | 3 | Riordan, Miss. Johanna Hannah”” | female | NaN | 0 | 0 | 334915 | 7.7208 | NaN | Q |
409 | 1301 | 3 | Peacock, Miss. Treasteall | female | 3.0 | 1 | 1 | SOTON/O.Q. 3101315 | 13.7750 | NaN | S |
410 | 1302 | 3 | Naughton, Miss. Hannah | female | NaN | 0 | 0 | 365237 | 7.7500 | NaN | Q |
411 | 1303 | 1 | Minahan, Mrs. William Edward (Lillian E Thorpe) | female | 37.0 | 1 | 0 | 19928 | 90.0000 | C78 | Q |
412 | 1304 | 3 | Henriksson, Miss. Jenny Lovisa | female | 28.0 | 0 | 0 | 347086 | 7.7750 | NaN | S |
413 | 1305 | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S |
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S |
416 | 1308 | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S |
417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C |
418 rows × 11 columns
Basic statistics can be viewed by calling the Describe method of DF:
PassengerId | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
count | 418.000000 | 418.000000 | 332.000000 | 418.000000 | 418.000000 | 417.000000 |
mean | 1100.500000 | 2.265550 | 30.272590 | 0.447368 | 0.392344 | 35.627188 |
std | 120.810458 | 0.841838 | 14.181209 | 0.896760 | 0.981429 | 55.907576 |
min | 892.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 |
25% | 996.250000 | 1.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
50% | 1100.500000 | 3.000000 | 27.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1204.750000 | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.500000 |
max | 1309.000000 | 3.000000 | 76.000000 | 8.000000 | 9.000000 | 512.329200 |
If you want to see the port where the passenger is logged in, you can choose:
df['Embarked'][:10]
Copy the code
0 Q
1 S
2 Q
3 S
4 S
5 S
6 Q
7 S
8 C
9 S
Name: Embarked, dtype: object
Copy the code
Value_counts counts them:
embark_counts=df['Embarked'].value_counts()
embark_counts[:10]
Copy the code
S 270
C 102
Q 46
Name: Embarked, dtype: int64
Copy the code
As can be seen from the results, 270 passengers have logged in from PORT S, 102 from port C and 46 from port Q.
Similarly, we can count the age information:
age_counts=df['Age'].value_counts()
age_counts.head(10)
Copy the code
The top 10 ages are as follows:
24.0 17
21.0 17
22.0 16
30.0 15
18.0 13
27.0 12
26.0 12
25.0 11
23.0 11
29.0 10
Name: Age, dtype: int64
Copy the code
Calculate the average age:
df['Age'].mean()
Copy the code
30.272590361445783
Copy the code
In fact, some data are age-free, and we can fill them in with averages:
clean_age1 = df['Age'].fillna(df['Age'].mean())
clean_age1.value_counts()
Copy the code
You can see that the average is 30.27 and the number is 86.
30.27259 86 24.00000 17 21.00000 17 22.00000 16 30.00000 15 18.00000 13 26.00000 12 27.00000 12 25.00000 11 23.00000 11 . 36.50000 1 40.50000 1 11.50000 1 34.00000 1 15.00000 1 7.00000 1 60.50000 1 26.50000 1 76.00000 1 34.50000 1 Name: Age, Length: 80, dtype: int64Copy the code
It may not be a good idea to use averages as ages. Another option is to discard averages:
clean_age2=df['Age'].dropna()
clean_age2
age_counts = clean_age2.value_counts()
ageset=age_counts.head(10)
ageset
Copy the code
24.0 17
21.0 17
22.0 16
30.0 15
18.0 13
27.0 12
26.0 12
25.0 11
23.0 11
29.0 10
Name: Age, dtype: int64
Copy the code
Graphical representation and matrix transformation
Graphics are very helpful for data analysis. We use bar charts to represent the ages of the top 10 obtained above:
import seaborn as sns
sns.barplot(x=ageset.index, y=ageset.values)
Copy the code
Now let’s do a complex matrix transformation. Let’s filter out the data where age and sex are null:
cframe=df[df.Age.notnull() & df.Sex.notnull()]
cframe
Copy the code
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
5 | 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | NaN | S |
6 | 898 | 3 | Connolly, Miss. Kate | female | 30.0 | 0 | 0 | 330972 | 7.6292 | NaN | Q |
7 | 899 | 2 | Caldwell, Mr. Albert Francis | male | 26.0 | 1 | 1 | 248738 | 29.0000 | NaN | S |
8 | 900 | 3 | Abrahim, Mrs. Joseph (Sophie Halaut Easu) | female | 18.0 | 0 | 0 | 2657 | 7.2292 | NaN | C |
9 | 901 | 3 | Davies, Mr. John Samuel | male | 21.0 | 2 | 0 | A/4 48871 | 24.1500 | NaN | S |
. | . | . | . | . | . | . | . | . | . | . | . |
403 | 1295 | 1 | Carrau, Mr. Jose Pedro | male | 17.0 | 0 | 0 | 113059 | 47.1000 | NaN | S |
404 | 1296 | 1 | Frauenthal, Mr. Isaac Gerald | male | 43.0 | 1 | 0 | 17765 | 27.7208 | D40 | C |
405 | 1297 | 2 | Nourney, Mr. Alfred (Baron von Drachstedt”)” | male | 20.0 | 0 | 0 | SC/PARIS 2166 | 13.8625 | D38 | C |
406 | 1298 | 2 | Ware, Mr. William Jeffery | male | 23.0 | 1 | 0 | 28666 | 10.5000 | NaN | S |
407 | 1299 | 1 | Widener, Mr. George Dunton | male | 50.0 | 1 | 1 | 113503 | 211.5000 | C80 | C |
409 | 1301 | 3 | Peacock, Miss. Treasteall | female | 3.0 | 1 | 1 | SOTON/O.Q. 3101315 | 13.7750 | NaN | S |
411 | 1303 | 1 | Minahan, Mrs. William Edward (Lillian E Thorpe) | female | 37.0 | 1 | 0 | 19928 | 90.0000 | C78 | Q |
412 | 1304 | 3 | Henriksson, Miss. Jenny Lovisa | female | 28.0 | 0 | 0 | 347086 | 7.7750 | NaN | S |
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S |
332 rows × 11 columns
Next use groupby to group age and sex:
by_sex_age = cframe.groupby(['Age', 'Sex'])
by_sex_age.size()
Copy the code
Age Sex 0.17 female 1 0.33 male 1 0.75 male 1 0.83 male 1 0.92 female 1 1.00 female 3 2.00 female 1 male 1 3.00 female 1 5.00 male 1.. 60.00 Female 3 60.50 male 1 61.00 male 2 62.00 male 1 63.00 Female 1 male 1 64.00 Female 2 male 1 67.00 male 1 76.00 female 1 Length: 115, dtype: int64Copy the code
Use unstack to convert Sex column data to rows:
Sex | female | male |
---|---|---|
Age | ||
0.17 | 1.0 | 0.0 |
0.33 | 0.0 | 1.0 |
0.75 | 0.0 | 1.0 |
0.83 | 0.0 | 1.0 |
0.92 | 1.0 | 0.0 |
1.00 | 3.0 | 0.0 |
2.00 | 1.0 | 1.0 |
3.00 | 1.0 | 0.0 |
5.00 | 0.0 | 1.0 |
6.00 | 0.0 | 3.0 |
. | . | . |
58.00 | 1.0 | 0.0 |
59.00 | 1.0 | 0.0 |
60.00 | 3.0 | 0.0 |
60.50 | 0.0 | 1.0 |
61.00 | 0.0 | 2.0 |
62.00 | 0.0 | 1.0 |
63.00 | 1.0 | 1.0 |
64.00 | 2.0 | 1.0 |
67.00 | 0.0 | 1.0 |
76.00 | 1.0 | 0.0 |
79 rows × 2 columns
We add the number of people with the same age and use argsort to get the sorted index:
indexer = agg_counts.sum(1).argsort()
indexer.tail(10)
Copy the code
Age
58.0 37
59.0 31
60.0 29
60.5 32
61.0 34
62.0 22
63.0 38
64.0 27
67.0 26
76.0 30
dtype: int64
Copy the code
Extract the last 10, the largest 10, from agG_COUNTS:
count_subset = agg_counts.take(indexer.tail(10))
count_subset=count_subset.tail(10)
count_subset
Copy the code
Sex | female | male |
---|---|---|
Age | ||
29.0 | 5.0 | 5.0 |
25.0 | 1.0 | 10.0 |
23.0 | 5.0 | 6.0 |
26.0 | 4.0 | 8.0 |
27.0 | 4.0 | 8.0 |
18.0 | 7.0 | 6.0 |
30.0 | 6.0 | 9.0 |
22.0 | 10.0 | 6.0 |
21.0 | 3.0 | 14.0 |
24.0 | 5.0 | 12.0 |
The above operation can be simplified to the following code:
agg_counts.sum(1).nlargest(10)
Copy the code
Age
21.0 17.0
24.0 17.0
22.0 16.0
30.0 15.0
18.0 13.0
26.0 12.0
27.0 12.0
23.0 11.0
25.0 11.0
29.0 10.0
dtype: float64
Copy the code
Count_subset for stack operation, convenient for the following drawing:
stack_subset = count_subset.stack()
stack_subset
Copy the code
Age Sex
29.0 female 5.0
male 5.0
25.0 female 1.0
male 10.0
23.0 female 5.0
male 6.0
26.0 female 4.0
male 8.0
27.0 female 4.0
male 8.0
18.0 female 7.0
male 6.0
30.0 female 6.0
male 9.0
22.0 female 10.0
male 6.0
21.0 female 3.0
male 14.0
24.0 female 5.0
male 12.0
dtype: float64
Copy the code
stack_subset.name = 'total'
stack_subset = stack_subset.reset_index()
stack_subset
Copy the code
Age | Sex | total | |
---|---|---|---|
0 | 29.0 | female | 5.0 |
1 | 29.0 | male | 5.0 |
2 | 25.0 | female | 1.0 |
3 | 25.0 | male | 10.0 |
4 | 23.0 | female | 5.0 |
5 | 23.0 | male | 6.0 |
6 | 26.0 | female | 4.0 |
7 | 26.0 | male | 8.0 |
8 | 27.0 | female | 4.0 |
9 | 27.0 | male | 8.0 |
10 | 18.0 | female | 7.0 |
11 | 18.0 | male | 6.0 |
12 | 30.0 | female | 6.0 |
13 | 30.0 | male | 9.0 |
14 | 22.0 | female | 10.0 |
15 | 22.0 | male | 6.0 |
16 | 21.0 | female | 3.0 |
17 | 21.0 | male | 14.0 |
18 | 24.0 | female | 5.0 |
19 | 24.0 | male | 12.0 |
The diagram is as follows:
sns.barplot(x='total', y='Age', hue='Sex', data=stack_subset)
Copy the code
Examples for this article can be found at: github.com/ddean2009/l…
This article is available at www.flydean.com/01-pandas-t…
The most popular interpretation, the most profound dry goods, the most concise tutorial, many you don’t
Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!