Introduction to the
Today we’ll cover the advanced tutorials for Pandas, including reading and writing files, selecting subsets, and graphical representations.
Read and write files
A key step in data processing is to read the file for analysis, and then write the analysis results to the file again.
Pandas supports reading and writing in a variety of file formats:
In [108]: pd.read_
read_clipboard() read_excel() read_fwf() read_hdf() read_json read_parquet read_sas read_sql_query read_stata
read_csv read_feather() read_gbq() read_html read_msgpack read_pickle read_sql read_sql_table read_table
Copy the code
The Titanic. CSV provided by Pandas’ website will be used as an example to explain how Pandas can be used.
Titanic. CSV provides the information of more than 800 passengers on Titanite. It is a matrix of 891 rows x 12 columns.
We use Pandas to read the CSV:
In [5]: titanic=pd.read_csv("titanic.csv")
Copy the code
The read_CSV method converts the CSV file to a DataFrame for pandas.
By default, we use the DF variable directly, and the first 5 lines and the last 5 lines are displayed by default:
In [3]: titanic Out[3]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 211 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female... 0ston /O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female... 0 113803 53.1000 C123 S 4 5 03 Allen, Mr. William Henry male... 0 373450 8.0500 NaN S.. . . . . . . . . . . . 886 887 0 2 Montvila, Rev. Juozas male ... 0 211536 13.0000 NaN S 887 888 11 Graham, Miss. Margaret... 0 112053 30.0000 B42 S 888 889 0 3 Miss. Catherine Helen "Carrie" female... 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male... 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick Male... 0 370376 7.775 NaN Q [891 rows x 12 columns]Copy the code
You can use head(n) and tail(n) to specify a specific number of rows:
In [4]: titanic.head(8) Out[4]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 211 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female... 0ston /O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female... 0 113803 53.1000 C123 S 4 5 03 Allen, Mr. William Henry male... 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male... 0 330877 8.4583 NaN Q 6 7 0 1 McCarthy, Mr. Timothy J male... E46 S 7 8 0 3 Palsson, Master. Gosta Leonard male... 1 349909 21.075 NaN S [8 rows x 12 columns]Copy the code
Use dtypes to view the data types for each column:
In [5]: titanic.dtypes
Out[5]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
Copy the code
Use to_excel to convert DF to excel files, and use read_excel to read excel files again:
In [11]: titanic.to_excel('titanic.xlsx', sheet_name='passengers', index=False)
In [12]: titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')
Copy the code
Use info() to do a preliminary count of DF:
In [14]: titanic.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: Float64 (2), INT64 (5), Object (5) Memory Usage: 83.6+ KBCopy the code
The choice of DF
Select column data
DF’s head or tail methods can only display all column data. The following methods can select specific column data.
In [15]: ages = Titanic ["Age"] In [16]: ages. Head () Out[16]: 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 Name: Age, dtype: float64Copy the code
Each column is a Series:
In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series
In [7]: titanic["Age"].shape
Out[7]: (891,)
Copy the code
You can also choose from several options:
In [8]: age_sex = titanic[["Age", "Sex"]]
In [9]: age_sex.head()
Out[9]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
Copy the code
If multiple columns are selected, the result returned is of type DF:
In [10]: type(titanic[["Age", "Sex"]])
Out[10]: pandas.core.frame.DataFrame
In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)
Copy the code
Select row data
We talked about how to select column data, now let’s look at how to select row data:
Select customers over 35 years old:
In [12]: above_35 = titanic[titanic["Age"] > 35] In [13]: above_35.head() Out[13]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 6 7 0 1 McCarthy, Mr. Timothy J male... 0 17463 51.8625 E46 S 11 12 11 Bonnell, Miss. Elizabeth female... 0 113783 26.5500c103 S 13 14 03 Andersson, Mr. Anders Johan male... 5 347082 31.2750 NaN S 15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) Female... 0 248706 16.000 NaN S [5 rows x 12 columns]Copy the code
Use ISIN to select Pclass in 2 and 3 for all customers:
In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])] In [17]: class_23.head() Out[17]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris Male 22.0 1 0 A/5 21171 7.2500 NaN S 2 3 1 3 Heikkinen Miss. Laina female 26.0 0 STON/O2. 3101282 7.9250 NaN S 4 50 3 Allen, Mr. William Henry Male 35.0 00 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James Male NaN 0 0 330877 8.4583 NaN Q 7 8 0 3 Palsson, Master. Gosta Leonard Male 2.0 3 1 349909 21.0750 NaN SCopy the code
Isin above is equal to:
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
Copy the code
Filter Age is not empty:
In [20]: age_no_na = titanic[titanic["Age"].notna()] In [21]: age_no_na.head() Out[21]: PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500 NaN S 1 211 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female... 0ston /O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) Female... 0 113803 53.1000 C123 S 4 5 03 Allen, Mr. William Henry male... 0 373450 8.0500 NaN S [5 rows x 12 columns]Copy the code
Select both rows and columns
We can select both rows and columns.
Rows and columns can be selected using LOC and ILOC. The difference between loC and ILOC is that LOC selects by name and ILOC selects by number.
Select age>35 from passenger name:
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object
Copy the code
The first value in loC represents row selection and the second value represents column selection.
Use ILOC to select:
In [25]: titanic.iloc[9:25, 2:5] Out[25]: Pclass Name Sex 9 2 Nasser, Mrs. Nicholas (Adele Achem) female 10 3 Sandstrom, Miss. Marguerite Rut female 11 1 Bonnell, Miss. Elizabeth female 12 3 Saundercock, Mr. William Henry male 13 3 Andersson, Mr. Anders Johan male .. . . . 20 2 Fynney, Mr. Joseph J male 21 2 Beesley, Mr. Lawrence male 22 3 McGowan, Miss. Anna "Annie" female 23 1 Sloper, Mr. William Thompson male 24 3 Palsson, Miss. Torborg Danira female [16 rows x 3 columns]Copy the code
Plot plots
How to convert DF into a diversified graphical display?
To use matplotlib on the command line, start ipython’s QT environment:
ipython qtconsole --pylab=inline
Copy the code
Use plot directly to show the passenger information we read above:
import matplotlib.pyplot as plt
import pandas as pd
titanic = pd.read_excel('titanic.xlsx', sheet_name='passengers')
titanic.plot()
Copy the code
The horizontal coordinate is the index in DF, and the column coordinate is the name of each column. Note that the columns above show only numeric types.
We only show age information:
titanic['Age'].plot()
Copy the code
The default is bar chart, we can convert the form of graph, such as dot chart:
Titanic. The plot. Scatter (x = "PassengerId", y = "Age", alpha = 0.5)Copy the code
Select PassengerId as X-axis and age as Y-axis in the data:
In addition to scatter plots, many other images are supported:
[method_name for method_name in dir(titanic.plot) if not method_name.startswith("_")]
Out[11]:
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
Copy the code
Here’s another box:
titanic['Age'].plot.box()
Copy the code
It can be seen that most of the passengers are between 20 and 40 years old.
You can also graph the selected columns separately:
titanic.plot.area(figsize=(12, 4), subplots=True)
Copy the code
Specify a specific column:
titanic[['Age','Pclass']].plot.area(figsize=(12, 4), subplots=True)
Copy the code
You can also draw and fill:
fig, axs = plt.subplots(figsize=(12, 4));
Copy the code
Draw an empty graph and fill it:
titanic['Age'].plot.area(ax=axs);
axs.set_ylabel("Age");
fig
Copy the code
Create a new column using an existing column
Sometimes we need to transform an existing column to get a new one. For example, if we want to add an Age2 column with the value Age +10, we can do this:
titanic["Age2"]=titanic["Age"]+10; Titanic [["Age","Age2"]]. Head () Out[34]: Age2 0 22.0 32.0 1 38.0 48.0 2 26.0 36.0 3 35.0 45.0 4 35.0 45.0Copy the code
Columns can also be renamed:
titanic_renamed = titanic.rename( ... : columns={"Age": "Age2", ... : "Pclass": "Pclas2"})Copy the code
Convert column names to lowercase:
titanic_renamed = titanic_renamed.rename(columns=str.lower)
Copy the code
The statistical
Let’s look at the average age of passengers:
Titanic [r]. "Age" scheme () Out [35] : 29.69911764705882Copy the code
Select the median:
Titanic [["Age", "Fare"]]. Median () Out[36]: Age 28.0000 Fare 14.4542 dType: float64Copy the code
More information:
titanic[["Age", "Fare"]].describe()
Out[37]:
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200
Copy the code
Use AGG to specify specific aggregation methods:
titanic.agg({'Age': ['min', 'max', 'median', 'skew'],'Fare': ['min', 'max', 'median', 'mean']})
Out[38]:
Age Fare
max 80.000000 512.329200
mean NaN 32.204208
median 28.000000 14.454200
min 0.420000 0.000000
skew 0.389108 NaN
Copy the code
You can use groupby:
Titanic [["Sex", "Age"]].groupby("Sex").mean() Out[39]: Age Sex female 27.915709 male 30.726645Copy the code
Groupby all columns:
titanic.groupby("Sex").mean()
Out[40]:
PassengerId Survived Pclass Age SibSp Parch
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702
Copy the code
You can also select specific columns after groupby:
Titanic. Groupby ("Sex")["Age"]. Mean () Out[41]: Sex female 27.915709 male 30.726645 Name: Age, dType: float64Copy the code
Can be categorized into count:
titanic["Pclass"].value_counts()
Out[42]:
3 491
1 216
2 184
Name: Pclass, dtype: int64
Copy the code
The above is equivalent to:
titanic.groupby("Pclass")["Pclass"].count()
Copy the code
DF restructuring
You can sort by a column:
titanic.sort_values(by="Age").head()
Out[43]:
PassengerId Survived Pclass Name Sex \
803 804 1 3 Thomas, Master. Assad Alexander male
755 756 1 2 Hamalainen, Master. Viljo male
644 645 1 3 Baclini, Miss. Eugenie female
469 470 1 3 Baclini, Miss. Helene Barbara female
78 79 1 2 Caldwell, Master. Alden Gates male
Copy the code
Sort by multiple columns:
titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
Out[44]:
PassengerId Survived Pclass Name Sex Age \
851 852 0 3 Svensson, Mr. Johan male 74.0
116 117 0 3 Connors, Mr. Patrick male 70.5
280 281 0 3 Duane, Mr. Frank male 65.0
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0
Copy the code
Select the specific row and column data, in the following example we will select the part of the data that is female:
female=titanic[titanic['Sex']=='female'] female_subset=female[["Age","Pclass","PassengerId","Survived"]].sort_values(["Pclass"]).groupby(["Pclass"]).head(2) female_subset Out[58]: Age Pclass PassengerId passenger 1 38.0 1 2 1 356 22.0 1 357 1 726 30.02 727 1 443 28.02 444 1 855 18.03 856 1 654 18.0 3, 655Copy the code
Pivot can be used to convert the axis:
female_subset.pivot(columns="Pclass", values="Age")
Out[62]:
Pclass 1 2 3
1 38.0 NaN NaN
356 22.0 NaN NaN
443 NaN 28.0 NaN
654 NaN NaN 18.0
726 NaN 30.0 NaN
855 NaN NaN 18.0
female_subset.pivot(columns="Pclass", values="Age").plot()
Copy the code
This article is available at www.flydean.com/02-python-p…
The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!
Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!