Pandas is a Python software package that is used to build basic programming libraries for machine learning in Python. This article is an introduction to it.

Pandas provides fast, flexible, and expressive data structures designed to make the work of “relational” or “tagging” data simple and intuitive. It is intended to be an advanced building block for actual data analysis in Python.

The introduction

Pandas is suitable for many different types of data, including:

  • Table data with heterogeneous type columns, such as SQL tables or Excel data
  • Ordered and unordered (not necessarily fixed frequency) time series data.
  • Arbitrary matrix data with row and column labels (uniform type or different type)
  • Any other form of observational/statistical data set.

Since this is a Python package, you need to have a Python environment on your machine first. For this, do your own research on the web.

Pandas Installation is required for pandas Installation.

In general, we can perform the installation via PIP:

sudo pip3 install pandas
Copy the code

Or install pandas via conda:

conda install pandas
Copy the code

The latest version of pandas is V0.22.0 (December 29, 2017).

I’ve posted the source code and test data for this article on Github: pandas_Tutorial, which you can access.

Pandas is often used in conjunction with NumPy, which is also used in the source code in this article.

It is recommended that you familiarize yourself with NumPy before learning it. For the Python Machine learning library, see the “NumPy” tutorial

Core data structure

The Series and DataFrame data structures are at the core of pandas.

The two types of data structures are compared as follows:

The name of the The dimension instructions
Series 1 d An array of isomorphic types with labels
DataFrame 2 d Table structure with labels, variable size, and can contain heterogeneous data columns

A DataFrame can be thought of as a container for Series, that is, a DataFrame can contain several Series.

Series

Since Series is one-dimensional data, we can create this data directly from arrays, like this:

# data_structure.py

import pandas as pd
import numpy as np

series1 = pd.Series([1, 2, 3, 4])
print("series1:\n{}\n".format(series1))
Copy the code

This code is printed as follows:

series1:
0    1
1    2
2    3
3    4
dtype: int64
Copy the code

The output is explained as follows:

  • The last line of the output is the type of the data in Series, where all the data isint64Type.
  • The data is output in the second column, and the first column is the index of the data, called in PANDASIndex.

We can print the data and index in a Series separately:

# data_structure.py

print("series1.values: {}\n".format(series1.values))

print("series1.index: {}\n".format(series1.index))
Copy the code

These two lines of code are printed as follows:

series1.values: [1 2 3 4]

series1.index: RangeIndex(start=0, stop=4, step=1)
Copy the code

If not specified (as above), the index is of the form [1, n-1]. However, we can also specify the index when we create a Series. An index does not have to be an integer; it can be any type of data, such as a string. For example, we map seven notes to seven letters. The purpose of the index is to obtain the corresponding data, such as the following:

# data_structure.py

series2 = pd.Series([1, 2, 3, 4, 5, 6, 7],
    index=["C", "D", "E", "F", "G", "A", "B"])
print("series2:\n{}\n".format(series2))
print("E is {}\n".format(series2["E"]))
Copy the code

This code is printed as follows:

series2:
C    1
D    2
E    3
F    4
G    5
A    6
B    7
dtype: int64

E is 3
Copy the code

DataFrame

Let’s look at creating a DataFrame. We can create a DataFrame by creating a 4×4 matrix through NumPy’s interface, like this:

# data_structur.py df1 = pd.dataframe (np.arange(16).0) Print ("df1:\n{}\n".format(df1))Copy the code

This code is printed as follows:

df1:
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
Copy the code

As you can see from this output, the default index and column names are of the form [0, n-1].

We can specify the column name and index when creating the DataFrame, like this:

# data_structur.py df2 = pd.dataframe (np.arange(16). 0 "column4"], index=["a", "b", "c", "d"]) print("df2:\n{}\n".format(df2))Copy the code

This code is printed as follows:

df2:
   column1  column2  column3  column4
a        0        1        2        3
b        4        5        6        7
c        8        9       10       11
d       12       13       14       15
Copy the code

We can also specify the column data directly to create the DataFrame:

# data_structure.py

df3 = pd.DataFrame({"note" : ["C", "D", "E", "F", "G", "A", "B"],
    "weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]})
print("df3:\n{}\n".format(df3))
Copy the code

This code is printed as follows:

df3:
  note weekday
0    C     Mon
1    D     Tue
2    E     Wed
3    F     Thu
4    G     Fri
5    A     Sat
6    B     Sun
Copy the code

Please note:

  • Different columns of a DataFrame can be different data types
  • If you create a DataFrame with a Series array, each Series becomes a row, not a column

Such as:

# data_structure.py

noteSeries = pd.Series(["C", "D", "E", "F", "G", "A", "B"],
    index=[1, 2, 3, 4, 5, 6, 7])
weekdaySeries = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    index=[1, 2, 3, 4, 5, 6, 7])
df4 = pd.DataFrame([noteSeries, weekdaySeries])
print("df4:\n{}\n".format(df4))
Copy the code

The df4 output is as follows:

df4:
     1    2    3    4    5    6    7
0    C    D    E    F    G    A    B
1  Mon  Tue  Wed  Thu  Fri  Sat  Sun
Copy the code

We can add or remove column data to a DataFrame as follows:

# data_structure.py

df3["No."] = pd.Series([1, 2, 3, 4, 5, 6, 7])
print("df3:\n{}\n".format(df3))

del df3["weekday"]
print("df3:\n{}\n".format(df3))
Copy the code

This code is printed as follows:

df3:
  note weekday  No.
0    C     Mon    1
1    D     Tue    2
2    E     Wed    3
3    F     Thu    4
4    G     Fri    5
5    A     Sat    6
6    B     Sun    7

df3:
  note  No.
0    C    1
1    D    2
2    E    3
3    F    4
4    G    5
5    A    6
6    B    7
Copy the code

Index objects and data access

The Index object contains metadata information that describes the axis. When creating a Series or DataFrame, the array or sequence of labels is converted to Index. The DataFrame column and row Index object can be obtained as follows:

# data_structure.py

print("df3.columns\n{}\n".format(df3.columns))
print("df3.index\n{}\n".format(df3.index))
Copy the code

These two lines of code are printed as follows:

df3.columns
Index(['note', 'No.'], dtype='object')

df3.index
RangeIndex(start=0, stop=7, step=1)
Copy the code

Please note:

  • Indexes are not collections, so they can contain duplicate data
  • The value of the Index object cannot be changed, so it is safe to access data

DataFrame provides the following two operators to access its data:

  • loc: Accesses data by row and column indexes
  • iloc: Accesses data by row and column subscripts

For example:

# data_structure.py

print("Note C, D is:\n{}\n".format(df3.loc[[0, 1], "note"]))
print("Note C, D is:\n{}\n".format(df3.iloc[[0, 1], 0]))
Copy the code

The first line of code accesses elements with row indexes 0 and 1 and column indexes “note”. The second line accesses elements with row subscripts 0 and 1 (for DF3, row indexes and row subscripts are exactly the same, so 0 and 1 are both here, but they have different meanings) and column subscripts 0.

These two lines of code are printed as follows:

Note C, D is:
0    C
1    D
Name: note, dtype: object

Note C, D is:
0    C
1    D
Name: note, dtype: object
Copy the code

File operations

The library pandas provides a series of read_ functions to read files in a variety of formats, as shown below:

  • read_csv
  • read_table
  • read_fwf
  • read_clipboard
  • read_excel
  • read_hdf
  • read_html
  • read_json
  • read_msgpack
  • read_pickle
  • read_sas
  • read_sql
  • read_stata
  • read_feather

Reading Excel files

Note: To read Excel files, you need to install another library:
xlrd

With PIP, installation can be done like this:

sudo pip3 install xlrd
Copy the code

After installation, the library can be viewed via PIP:

$pip3 show XLRD Name: XLRD Version: 1.1.0 Library for developers to extract data from Microsoft Excel (tm) spreadsheet files Home-page: http://www.python-excel.org/ Author: John Machin Author-email: [email protected] License: BSD Location: / Library/Frameworks/Python framework Versions / 3.6 / lib/python3.6 / site - packages Requires:Copy the code

Let’s look at a simple example of reading Excel:

# file_operation.py

import pandas as pd
import numpy as np

df1 = pd.read_excel("data/test.xlsx")
print("df1:\n{}\n".format(df1))
Copy the code

The content of this Excel is as follows:

df1:
   C  Mon
0  D  Tue
1  E  Wed
2  F  Thu
3  G  Fri
4  A  Sat
5  B  Sun
Copy the code

Note: The code and data files for this article are available from the Github repository mentioned at the beginning of this article.

Reading a CSV file

Next, let’s look at the example of reading a CSV file.

The first CSV file contains the following contents:

$ cat test1.csv 
C,Mon
D,Tue
E,Wed
F,Thu
G,Fri
A,Sat
Copy the code

The way to read is also simple:

# file_operation.py

df2 = pd.read_csv("data/test1.csv")
print("df2:\n{}\n".format(df2))
Copy the code

Let’s look at the second example, which reads as follows:

$ cat test2.csv 
C|Mon
D|Tue
E|Wed
F|Thu
G|Fri
A|Sat
Copy the code

Strictly speaking, this is not a CSV file, because its data is not comma-separated. In this case, we can read the file by specifying a delimiter, like this:

# file_operation.py

df3 = pd.read_csv("data/test2.csv", sep="|")
print("df3:\n{}\n".format(df3))
Copy the code

In fact, read_CSV supports a number of parameters to adjust the read parameters, as shown in the following table:

parameter instructions
path The file path
Sep or delimiter Field separator
header Number of rows for column names, default is 0 (first row)
index_col The column number or name is used as the row index in the result
names A list of column names for the result
skiprows The number of rows skipped from the starting position
na_values Instead ofNAThe value of the sequence
comment The character that separates comments at the end of a line
parse_dates Try to parse the data todatetime. The default isFalse
keep_date_col If columns are joined to a resolution date, the joined columns are preserved. The default isFalse.
converters Column converter
dayfirst Stored in internal form when parsing dates that can cause ambiguity. The default isFalse
data_parser A function used to parse dates
nrows The number of lines read from the file
iterator Return a TextParser object that reads part of the content
chunksize Specifies the size of the read block
skip_footer The number of lines at the end of the file to ignore
verbose Output various parsing output information
encoding File coding
squeeze If the parsed data contains only one column, one is returnedSeries
thousands Thousands of separators

For a detailed description of the read_csv function, see here: pandas. Read_csv

Handling invalid values

The real world is not perfect, and we often read data with invalid values. If these invalid values are not handled properly, they can cause a lot of interference in the program.

There are two main ways to deal with invalid values: ignore them directly; Or replace an invalid value with a valid value.

I’ll start by creating a data structure that contains invalid values. The pandas. Isna function is then used to determine which values are invalid:

# process_na.py import pandas as pd import numpy as np df = pd.dataframe ([[1.0, np.nan, 3.0, 4.0], [5.0, np.nan, Np. Nan, 8.0], [9.0, np. Nan, np. Nan, 12.0], [13.0, np. Nan, 15.0, 16.0]]) print (" df: {} \ n \ n ". The format (df)); print("df:\n{}\n".format(pd.isna(df))); * * * *Copy the code

This code is printed as follows:

Df: 0 12 3 0 1.0 NaN 3.0 4.0 15.0 NaN 8.0 2 9.0 NaN NaN 12.0 3 13.0 NaN 15.0 16.0 df: 0 1 2 3 0 False True False False 1 False True True False 2 False True True False 3 False True False FalseCopy the code

Ignore invalid values

We can discard invalid values by using the pandas. Datafame. Dropna function:

# process_na.py

print("df.dropna():\n{}\n".format(df.dropna()));
Copy the code

Note:
dropnaThe default does not change the original data structure, but returns a new data structure. If you want to change the data itself directly, you can pass arguments when you call this function
inplace = True.

With the original structure, when all invalid values are discarded, it is no longer a valid DataFrame, so this line of code prints:

df.dropna():
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
Copy the code

We can also choose to discard the entire column with invalid values:

# process_na.py

print("df.dropna(axis=1, how='all'):\n{}\n".format(df.dropna(axis=1, how='all')));
Copy the code

Note:
axis=1Represents the axis of a column. How can be ‘any’ or ‘all’, with the former being the default.

This line of code is printed as follows:

Df.dropna (axis=1, how='all'): 0 2 3 0 1.0 3.0 4.0 15.0 NaN 8.0 2 9.0 NaN 12.0 3 13.0 15.0 16.0Copy the code

Replace invalid value

We can also replace invalid values with valid ones using fillna. Like this:

# process_na.py

print("df.fillna(1):\n{}\n".format(df.fillna(1)));
Copy the code

This code is printed as follows:

Df. Fillna (1): 0 12 3 0 1.0 1.0 3.0 4.0 15.0 1.0 1.0 8.0 2 9.0 1.0 1.0 12.0 3 13.0 1.0 15.0 16.0Copy the code

It may not make sense to replace all invalid values with the same data, so we can specify different data to populate. For ease of operation, we can rename the row and column names using the rename method before populating:

# process_na.py

df.rename(index={0: 'index1', 1: 'index2', 2: 'index3', 3: 'index4'},
          columns={0: 'col1', 1: 'col2', 2: 'col3', 3: 'col4'},
          inplace=True);
df.fillna(value={'col2': 2}, inplace=True)
df.fillna(value={'col3': 7}, inplace=True)
print("df:\n{}\n".format(df));
Copy the code

This code is printed as follows:

Df: col1 COL2 COL3 COL4 index1 1.0 2.0 3.0 4.0 index2 5.0 2.0 7.0 8.0 Index3 9.0 2.0 7.0 12.0 Index4 13.0 2.0 15.0 16.0Copy the code

Handling strings

Manipulating strings is often involved in data, so we’ll look at manipulating strings in pandas.

The STR field of Series contains a Series of functions to process strings. Moreover, these functions automatically handle invalid values.

Here are some examples. In the first set of data, we deliberately set some strings that contain Spaces:

# process_string.py

import pandas as pd

s1 = pd.Series([' 1', '2 ', ' 3 ', '4', '5']);
print("s1.str.rstrip():\n{}\n".format(s1.str.lstrip()))
print("s1.str.strip():\n{}\n".format(s1.str.strip()))
print("s1.str.isdigit():\n{}\n".format(s1.str.isdigit()))
Copy the code

In this example we see processing the string strip and checking whether the string itself is a number. This output is as follows:

s1.str.rstrip():
0     1
1    2 
2    3 
3     4
4     5
dtype: object

s1.str.strip():
0    1
1    2
2    3
3    4
4    5
dtype: object

s1.str.isdigit():
0    False
1    False
2    False
3     True
4     True
dtype: bool
Copy the code

Here are some other examples of string uppercase, lowercase, and string length:

# process_string.py

s2 = pd.Series(['Stairway to Heaven', 'Eruption', 'Freebird',
                    'Comfortably Numb', 'All Along the Watchtower'])
print("s2.str.lower():\n{}\n".format(s2.str.lower()))
print("s2.str.upper():\n{}\n".format(s2.str.upper()))
print("s2.str.len():\n{}\n".format(s2.str.len()))
Copy the code

The code output is as follows:

s2.str.lower():
0          stairway to heaven
1                    eruption
2                    freebird
3            comfortably numb
4    all along the watchtower
dtype: object

s2.str.upper():
0          STAIRWAY TO HEAVEN
1                    ERUPTION
2                    FREEBIRD
3            COMFORTABLY NUMB
4    ALL ALONG THE WATCHTOWER
dtype: object

s2.str.len():
0    18
1     8
2     8
3    16
4    24
dtype: int64
Copy the code

conclusion

This article is an introductory tutorial for PANDAS, so we have covered only the most basic operations. for

  • MultiIndex/Advanced Indexing
  • Merge, join, concatenate
  • Computational tools

And so on advanced function, later have the opportunity we come to study together.

Readers can also follow the links below for more information.

Resources and recommended readings

  • Pandas Official website
  • Python for Data Analysis
  • Pandas Tutorial: Data analysis with Python: Part 1