27, 000 words, 103 days, 16 articles: Pandas Data Analysis

Pandas data analysis is simple

Hello, I’m Peter

The first edition of Pandas Data Analysis is here to see you. At the end of the article is the method of obtaining the data

It took 103 days from the first Pandas article, “It All Starts with Exploding Functions,” on April 24, to “Diagram Pandas’ Axis Rotation functions: Stack and UnStack,” published yesterday, August 5. Here’s a look at Pandas:

Two lines of code tell you the time difference between two dates, and this is Pandas👏

What are the Pandas

What is Pandas? Here is an explanation from Pandas’ official website:

Pandas is Python’s core data analysis support library. It provides fast, flexible, and unambiguous data structures designed to handle relational and markup data in a simple and intuitive way. Pandas aims to be the required advanced tool for Python data analysis practices and practices. Its long-term goal is to be the most powerful, flexible, and open source data analysis tool available in any language

Pandas is Python’s third library for data manipulation and analysis.

What data can Pandas handle

Pandas is a powerful data analysis library. What types of data can it handle?

Tabular data similar to SQL and Excel
Ordered and disordered time series data, commonly used in finance
Matrix data with column and row tags, since Pandas itself is also developed based on Numpy

103 days what did I write

A total of 16 articles were updated for Pandas over the 103 days:

It all starts with the explosion function

This article will focus on the use of explode, a function in Pandas

It implements a function similar to The explode function in Hive: count the information on the right from the information on the left

Article 2: Series type data

There are two data types in Pandas, one of which is Series.

A Series is a one-dimensional array structure consisting only of index and value.

Creating a DataFrame: 10 ways you can do it

This third article looks at 10 ways to create one of the most common data structures in Pandas: a DataFrame.

A DataFrame is a two-dimensional data structure that combines several Series into columns. Each column is a Series. In addition to having index and value, you have columns

After I wrote it, I remembered that I missed one way: to create it directly from the clipboard. When we have the data ready in the clipboard, run the following statement to create it directly:

df = pd.read_clipboard()
df
Copy the code

Part four: The number operation of Pandas

The first two articles introduced the creation of Series and DataFrame data structures, so here’s how to get the data we want out of them.

There are so many different ways to take numbers in Pandas that it took a total of three articles. The methods in the fourth article are as follows:

Fifth: Excellent! Pandas screens the data in a wide variety of ways

The following is an article about taking numbers in Pandas:

The last one is to play at Pandas

The final article describes how to take numbers in Pandas, focusing on three pairs of functions: are there subtle differences in how they are used

Chapter seven: The cornerstone of data processing: Data exploration

Before we import the data to Pandas for subsequent processing, we need to check the basic information of the data and have a preliminary understanding of the data, which generally includes the following information:

Pandas data Type Operations

When manipulating data, it is important for Pandas to ensure that data types are accurate. In this article, we will look at three common data type conversions and filter methods for Pandas.

Cast using the astype() function
Use custom functions to convert data types
The functions provided by Pandas such as to_numeric(), to_datetime(), and so on are used for translation
Use of select_dtypes functions

Pandas’ Groupby mechanism

Groupby statistics is a common method in work and data processing projects. This article explores the inner workings of GroupBy.

Article 10: Diagram of Pandas’ ranking mechanism

This article is an analogy to the rank and window functions in SQL, and describes how to use the Rank function in Pandas to do this:

Row_number: order ranking, method=first in the rank function
Rank: jump rank (method=min
Dense_rank: dense rank, method=dense

Article 11: Diagram sort_values for Pandas

With the ranking, must also come to achieve a sort. The sort_values function is frequently used in daily life. TopN analysis of sales data is often needed, so sorting is required after grouping statistical data.

Article 12: Diagrams for missing values in Pandas

In general, the data is not perfect. We need to do a variety of early processing operations, missing value processing is one of them.

This article describes the manipulation of missing values by Pandas, including determining missing values, deleting missing values, and filling missing values.

Df.isnull (), df.notnull() : These two functions are the inverse of each other
Df.isna () : equivalent to df.isnull()
Df.dropna () : deletes the missing value
Df.fillna () : fills the missing value

Pandas repeats value processing

It is common to have duplicate values in data. This article mainly introduces two ways of processing duplicate values:

Duplicated () : Determines whether there is a duplicated value
Drop_duplicates () : deletes duplicate values

Part 14: Challenging SQL: Diagram Pandas’ data merge

For real business needs, our data may exist in different library tables, and SQL can be implemented through various joins, mainly through the merge function in Pandas.

In this article, we detailed how to use the merge parameters:

pd.merge(left,   # Two data boxes to be merged
         right, 
         how='inner'.# 'left', 'right', 'outer', 'inner', 'cross'
         on=None.The default is the same key
         left_on=None.# Specify different connection fields: the key is different, but the value of the key has the same content
         right_on=None, 
         left_index=False.# Join by index
         right_index=False, 
         sort=False.# Sort or not
         suffixes=('_x'.'_y'),   # Change the suffix
         copy=True, 
         indicator=False.# Display field source
         validate=None)
Copy the code

Text Pandas data merge: Concat, JOIN, and Append

In addition to the merge function, there are three other functions that can partially merge data in Pandas: concat, Join, and Append. Concat, in particular, is quite common.

Concat parameters:

pandas.concat(objs,  # Merge objects
              axis=0.# Merge direction, default is 0 vertical direction
              join='outer'.# merge = inner (intersection) or outer (union
              ignore_index=False.# Whether indexes will be re-indexed after merge
              keys=None.# Add the name of the original data in the row index direction; It is mainly used for hierarchical indexing, which can be any list or array, tuple data or list array
              levels=None.# Specifies the index to be used on the level of the hierarchical index, if keys is set
              names=None.# The name of the line index, in list form
              verify_integrity=False.# Check whether the row index is duplicate; There is an error
              sort=False.# Sort unconnected axes
              copy=True   # Whether to make a deep copy
             )
Copy the code

The join parameters:

dataframe.join(other,  # Another data box to merge
        on=None.# Connect key
        how='left'.'left', 'right', 'outer', 'inner' default is left
        lsuffix=' '.Suffix for the same key in the left (first) data box
        rsuffix=' '.The suffix for the key of the second data box
        sort=False)  Whether to sort by join key; The default False
Copy the code

Append main parameters:

DataFrame.append(
  other,  # Append object
  ignore_index=False.Whether to keep the original index
  verify_integrity=False.# Check whether the row index is duplicate; There is an error
  sort=False)
Copy the code

Article 16: Diagram stack and unstack for Pandas

Stack and unstack are mutually inverse functions. They rotate the Pandas data axis.

Stack: rotates data columns into row index
Unstack: Rotates the data row index into columns
Both default to the innermost layer

Here are two images from the website to explain how they work:

What are the features of the article

In the process of writing, I have referred to the official website and a lot of materials, as well as some of my own experience in daily use. At the same time, I have simulated a lot of data, and summed up the following features:

Examples abound: Each article is illustrated by simulation
Illustrated: the article uses a lot of graphics to explain the use of functions, more intuitive, deepen the impression
Close to reality: Much of the simulated data can be applied directly to real business scenarios

The follow-up work

What I’ve written so far is really just the tip of the iceberg for Pandas, and there’s a lot more to be done. If you actually run and understand the code, you will get a lot out of pandas. 🐂

Subsequent articles at Pandas will be updated continuously, and this will be a long process. More advanced techniques and examples will be included to help you master Pandas.

To get it, follow the official account “Youerhuatang” and reply to Pandas