Pandas is a Python software package that is used to build basic programming libraries for machine learning in Python. This article is an introduction to it.
Pandas provides fast, flexible, and expressive data structures designed to make the work of “relational” or “tagging” data simple and intuitive. It is intended to be an advanced building block for actual data analysis in Python.
The introduction
Pandas is suitable for many different types of data, including:
- Table data with heterogeneous type columns, such as SQL tables or Excel data
- Ordered and unordered (not necessarily fixed frequency) time series data.
- Arbitrary matrix data with row and column labels (uniform type or different type)
- Any other form of observational/statistical data set.
Since this is a Python package, you need to have a Python environment on your machine first. For this, do your own research on the web.
Pandas Installation is required for pandas Installation.
In general, we can perform the installation via PIP:
sudo pip3 install pandas
Copy the code
Or install pandas via conda:
conda install pandas
Copy the code
The latest version of pandas is V0.22.0 (December 29, 2017).
I’ve posted the source code and test data for this article on Github: pandas_Tutorial, which you can access.
Pandas is often used in conjunction with NumPy, which is also used in the source code in this article.
It is recommended that you familiarize yourself with NumPy before learning it. For the Python Machine learning library, see the “NumPy” tutorial
Core data structure
The Series and DataFrame data structures are at the core of pandas.
The two types of data structures are compared as follows:
The name of the | The dimension | instructions |
---|---|---|
Series | 1 d | An array of isomorphic types with labels |
DataFrame | 2 d | Table structure with labels, variable size, and can contain heterogeneous data columns |
A DataFrame can be thought of as a container for Series, that is, a DataFrame can contain several Series.
Series
Since Series is one-dimensional data, we can create this data directly from arrays, like this:
# data_structure.py
import pandas as pd
import numpy as np
series1 = pd.Series([1, 2, 3, 4])
print("series1:\n{}\n".format(series1))
Copy the code
This code is printed as follows:
series1:
0 1
1 2
2 3
3 4
dtype: int64
Copy the code
The output is explained as follows:
- The last line of the output is the type of the data in Series, where all the data is
int64
Type. - The data is output in the second column, and the first column is the index of the data, called in PANDAS
Index
.
We can print the data and index in a Series separately:
# data_structure.py
print("series1.values: {}\n".format(series1.values))
print("series1.index: {}\n".format(series1.index))
Copy the code
These two lines of code are printed as follows:
series1.values: [1 2 3 4]
series1.index: RangeIndex(start=0, stop=4, step=1)
Copy the code
If not specified (as above), the index is of the form [1, n-1]. However, we can also specify the index when we create a Series. An index does not have to be an integer; it can be any type of data, such as a string. For example, we map seven notes to seven letters. The purpose of the index is to obtain the corresponding data, such as the following:
# data_structure.py
series2 = pd.Series([1, 2, 3, 4, 5, 6, 7],
index=["C", "D", "E", "F", "G", "A", "B"])
print("series2:\n{}\n".format(series2))
print("E is {}\n".format(series2["E"]))
Copy the code
This code is printed as follows:
series2:
C 1
D 2
E 3
F 4
G 5
A 6
B 7
dtype: int64
E is 3
Copy the code
DataFrame
Let’s look at creating a DataFrame. We can create a DataFrame by creating a 4×4 matrix through NumPy’s interface, like this:
# data_structur.py df1 = pd.dataframe (np.arange(16).0) Print ("df1:\n{}\n".format(df1))Copy the code
This code is printed as follows:
df1:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
Copy the code
As you can see from this output, the default index and column names are of the form [0, n-1].
We can specify the column name and index when creating the DataFrame, like this:
# data_structur.py df2 = pd.dataframe (np.arange(16). 0 "column4"], index=["a", "b", "c", "d"]) print("df2:\n{}\n".format(df2))Copy the code
This code is printed as follows:
df2:
column1 column2 column3 column4
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
Copy the code
We can also specify the column data directly to create the DataFrame:
# data_structure.py
df3 = pd.DataFrame({"note" : ["C", "D", "E", "F", "G", "A", "B"],
"weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]})
print("df3:\n{}\n".format(df3))
Copy the code
This code is printed as follows:
df3:
note weekday
0 C Mon
1 D Tue
2 E Wed
3 F Thu
4 G Fri
5 A Sat
6 B Sun
Copy the code
Please note:
- Different columns of a DataFrame can be different data types
- If you create a DataFrame with a Series array, each Series becomes a row, not a column
Such as:
# data_structure.py
noteSeries = pd.Series(["C", "D", "E", "F", "G", "A", "B"],
index=[1, 2, 3, 4, 5, 6, 7])
weekdaySeries = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
index=[1, 2, 3, 4, 5, 6, 7])
df4 = pd.DataFrame([noteSeries, weekdaySeries])
print("df4:\n{}\n".format(df4))
Copy the code
The df4 output is as follows:
df4:
1 2 3 4 5 6 7
0 C D E F G A B
1 Mon Tue Wed Thu Fri Sat Sun
Copy the code
We can add or remove column data to a DataFrame as follows:
# data_structure.py
df3["No."] = pd.Series([1, 2, 3, 4, 5, 6, 7])
print("df3:\n{}\n".format(df3))
del df3["weekday"]
print("df3:\n{}\n".format(df3))
Copy the code
This code is printed as follows:
df3:
note weekday No.
0 C Mon 1
1 D Tue 2
2 E Wed 3
3 F Thu 4
4 G Fri 5
5 A Sat 6
6 B Sun 7
df3:
note No.
0 C 1
1 D 2
2 E 3
3 F 4
4 G 5
5 A 6
6 B 7
Copy the code
Index objects and data access
The Index object contains metadata information that describes the axis. When creating a Series or DataFrame, the array or sequence of labels is converted to Index. The DataFrame column and row Index object can be obtained as follows:
# data_structure.py
print("df3.columns\n{}\n".format(df3.columns))
print("df3.index\n{}\n".format(df3.index))
Copy the code
These two lines of code are printed as follows:
df3.columns
Index(['note', 'No.'], dtype='object')
df3.index
RangeIndex(start=0, stop=7, step=1)
Copy the code
Please note:
- Indexes are not collections, so they can contain duplicate data
- The value of the Index object cannot be changed, so it is safe to access data
DataFrame provides the following two operators to access its data:
loc
: Accesses data by row and column indexesiloc
: Accesses data by row and column subscripts
For example:
# data_structure.py
print("Note C, D is:\n{}\n".format(df3.loc[[0, 1], "note"]))
print("Note C, D is:\n{}\n".format(df3.iloc[[0, 1], 0]))
Copy the code
The first line of code accesses elements with row indexes 0 and 1 and column indexes “note”. The second line accesses elements with row subscripts 0 and 1 (for DF3, row indexes and row subscripts are exactly the same, so 0 and 1 are both here, but they have different meanings) and column subscripts 0.
These two lines of code are printed as follows:
Note C, D is:
0 C
1 D
Name: note, dtype: object
Note C, D is:
0 C
1 D
Name: note, dtype: object
Copy the code
File operations
The library pandas provides a series of read_ functions to read files in a variety of formats, as shown below:
- read_csv
- read_table
- read_fwf
- read_clipboard
- read_excel
- read_hdf
- read_html
- read_json
- read_msgpack
- read_pickle
- read_sas
- read_sql
- read_stata
- read_feather
Reading Excel files
Note: To read Excel files, you need to install another library:
xlrd
With PIP, installation can be done like this:
sudo pip3 install xlrd
Copy the code
After installation, the library can be viewed via PIP:
$pip3 show XLRD Name: XLRD Version: 1.1.0 Library for developers to extract data from Microsoft Excel (tm) spreadsheet files Home-page: http://www.python-excel.org/ Author: John Machin Author-email: [email protected] License: BSD Location: / Library/Frameworks/Python framework Versions / 3.6 / lib/python3.6 / site - packages Requires:Copy the code
Let’s look at a simple example of reading Excel:
# file_operation.py
import pandas as pd
import numpy as np
df1 = pd.read_excel("data/test.xlsx")
print("df1:\n{}\n".format(df1))
Copy the code
The content of this Excel is as follows:
df1:
C Mon
0 D Tue
1 E Wed
2 F Thu
3 G Fri
4 A Sat
5 B Sun
Copy the code
Note: The code and data files for this article are available from the Github repository mentioned at the beginning of this article.
Reading a CSV file
Next, let’s look at the example of reading a CSV file.
The first CSV file contains the following contents:
$ cat test1.csv
C,Mon
D,Tue
E,Wed
F,Thu
G,Fri
A,Sat
Copy the code
The way to read is also simple:
# file_operation.py
df2 = pd.read_csv("data/test1.csv")
print("df2:\n{}\n".format(df2))
Copy the code
Let’s look at the second example, which reads as follows:
$ cat test2.csv
C|Mon
D|Tue
E|Wed
F|Thu
G|Fri
A|Sat
Copy the code
Strictly speaking, this is not a CSV file, because its data is not comma-separated. In this case, we can read the file by specifying a delimiter, like this:
# file_operation.py
df3 = pd.read_csv("data/test2.csv", sep="|")
print("df3:\n{}\n".format(df3))
Copy the code
In fact, read_CSV supports a number of parameters to adjust the read parameters, as shown in the following table:
parameter | instructions |
---|---|
path | The file path |
Sep or delimiter | Field separator |
header | Number of rows for column names, default is 0 (first row) |
index_col | The column number or name is used as the row index in the result |
names | A list of column names for the result |
skiprows | The number of rows skipped from the starting position |
na_values | Instead ofNA The value of the sequence |
comment | The character that separates comments at the end of a line |
parse_dates | Try to parse the data todatetime . The default isFalse |
keep_date_col | If columns are joined to a resolution date, the joined columns are preserved. The default isFalse . |
converters | Column converter |
dayfirst | Stored in internal form when parsing dates that can cause ambiguity. The default isFalse |
data_parser | A function used to parse dates |
nrows | The number of lines read from the file |
iterator | Return a TextParser object that reads part of the content |
chunksize | Specifies the size of the read block |
skip_footer | The number of lines at the end of the file to ignore |
verbose | Output various parsing output information |
encoding | File coding |
squeeze | If the parsed data contains only one column, one is returnedSeries |
thousands | Thousands of separators |
For a detailed description of the read_csv function, see here: pandas. Read_csv
Handling invalid values
The real world is not perfect, and we often read data with invalid values. If these invalid values are not handled properly, they can cause a lot of interference in the program.
There are two main ways to deal with invalid values: ignore them directly; Or replace an invalid value with a valid value.
I’ll start by creating a data structure that contains invalid values. The pandas. Isna function is then used to determine which values are invalid:
# process_na.py import pandas as pd import numpy as np df = pd.dataframe ([[1.0, np.nan, 3.0, 4.0], [5.0, np.nan, Np. Nan, 8.0], [9.0, np. Nan, np. Nan, 12.0], [13.0, np. Nan, 15.0, 16.0]]) print (" df: {} \ n \ n ". The format (df)); print("df:\n{}\n".format(pd.isna(df))); * * * *Copy the code
This code is printed as follows:
Df: 0 12 3 0 1.0 NaN 3.0 4.0 15.0 NaN 8.0 2 9.0 NaN NaN 12.0 3 13.0 NaN 15.0 16.0 df: 0 1 2 3 0 False True False False 1 False True True False 2 False True True False 3 False True False FalseCopy the code
Ignore invalid values
We can discard invalid values by using the pandas. Datafame. Dropna function:
# process_na.py
print("df.dropna():\n{}\n".format(df.dropna()));
Copy the code
Note:
dropna
The default does not change the original data structure, but returns a new data structure. If you want to change the data itself directly, you can pass arguments when you call this function
inplace = True
.
With the original structure, when all invalid values are discarded, it is no longer a valid DataFrame, so this line of code prints:
df.dropna():
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
Copy the code
We can also choose to discard the entire column with invalid values:
# process_na.py
print("df.dropna(axis=1, how='all'):\n{}\n".format(df.dropna(axis=1, how='all')));
Copy the code
Note:
axis=1
Represents the axis of a column. How can be ‘any’ or ‘all’, with the former being the default.
This line of code is printed as follows:
Df.dropna (axis=1, how='all'): 0 2 3 0 1.0 3.0 4.0 15.0 NaN 8.0 2 9.0 NaN 12.0 3 13.0 15.0 16.0Copy the code
Replace invalid value
We can also replace invalid values with valid ones using fillna. Like this:
# process_na.py
print("df.fillna(1):\n{}\n".format(df.fillna(1)));
Copy the code
This code is printed as follows:
Df. Fillna (1): 0 12 3 0 1.0 1.0 3.0 4.0 15.0 1.0 1.0 8.0 2 9.0 1.0 1.0 12.0 3 13.0 1.0 15.0 16.0Copy the code
It may not make sense to replace all invalid values with the same data, so we can specify different data to populate. For ease of operation, we can rename the row and column names using the rename method before populating:
# process_na.py
df.rename(index={0: 'index1', 1: 'index2', 2: 'index3', 3: 'index4'},
columns={0: 'col1', 1: 'col2', 2: 'col3', 3: 'col4'},
inplace=True);
df.fillna(value={'col2': 2}, inplace=True)
df.fillna(value={'col3': 7}, inplace=True)
print("df:\n{}\n".format(df));
Copy the code
This code is printed as follows:
Df: col1 COL2 COL3 COL4 index1 1.0 2.0 3.0 4.0 index2 5.0 2.0 7.0 8.0 Index3 9.0 2.0 7.0 12.0 Index4 13.0 2.0 15.0 16.0Copy the code
Handling strings
Manipulating strings is often involved in data, so we’ll look at manipulating strings in pandas.
The STR field of Series contains a Series of functions to process strings. Moreover, these functions automatically handle invalid values.
Here are some examples. In the first set of data, we deliberately set some strings that contain Spaces:
# process_string.py
import pandas as pd
s1 = pd.Series([' 1', '2 ', ' 3 ', '4', '5']);
print("s1.str.rstrip():\n{}\n".format(s1.str.lstrip()))
print("s1.str.strip():\n{}\n".format(s1.str.strip()))
print("s1.str.isdigit():\n{}\n".format(s1.str.isdigit()))
Copy the code
In this example we see processing the string strip and checking whether the string itself is a number. This output is as follows:
s1.str.rstrip():
0 1
1 2
2 3
3 4
4 5
dtype: object
s1.str.strip():
0 1
1 2
2 3
3 4
4 5
dtype: object
s1.str.isdigit():
0 False
1 False
2 False
3 True
4 True
dtype: bool
Copy the code
Here are some other examples of string uppercase, lowercase, and string length:
# process_string.py
s2 = pd.Series(['Stairway to Heaven', 'Eruption', 'Freebird',
'Comfortably Numb', 'All Along the Watchtower'])
print("s2.str.lower():\n{}\n".format(s2.str.lower()))
print("s2.str.upper():\n{}\n".format(s2.str.upper()))
print("s2.str.len():\n{}\n".format(s2.str.len()))
Copy the code
The code output is as follows:
s2.str.lower():
0 stairway to heaven
1 eruption
2 freebird
3 comfortably numb
4 all along the watchtower
dtype: object
s2.str.upper():
0 STAIRWAY TO HEAVEN
1 ERUPTION
2 FREEBIRD
3 COMFORTABLY NUMB
4 ALL ALONG THE WATCHTOWER
dtype: object
s2.str.len():
0 18
1 8
2 8
3 16
4 24
dtype: int64
Copy the code
conclusion
This article is an introductory tutorial for PANDAS, so we have covered only the most basic operations. for
- MultiIndex/Advanced Indexing
- Merge, join, concatenate
- Computational tools
And so on advanced function, later have the opportunity we come to study together.
Readers can also follow the links below for more information.
Resources and recommended readings
- Pandas Official website
- Python for Data Analysis
- Pandas Tutorial: Data analysis with Python: Part 1