Pandas data analysis is simple
Hello, I’m Peter
The first edition of Pandas Data Analysis is here to see you. At the end of the article is the method of obtaining the data
It took 103 days from the first Pandas article, “It All Starts with Exploding Functions,” on April 24, to “Diagram Pandas’ Axis Rotation functions: Stack and UnStack,” published yesterday, August 5. Here’s a look at Pandas:
Two lines of code tell you the time difference between two dates, and this is Pandas👏
What are the Pandas
What is Pandas? Here is an explanation from Pandas’ official website:
Pandas is Python’s core data analysis support library. It provides fast, flexible, and unambiguous data structures designed to handle relational and markup data in a simple and intuitive way. Pandas aims to be the required advanced tool for Python data analysis practices and practices. Its long-term goal is to be the most powerful, flexible, and open source data analysis tool available in any language
Pandas is Python’s third library for data manipulation and analysis.
What data can Pandas handle
Pandas is a powerful data analysis library. What types of data can it handle?
- Tabular data similar to SQL and Excel
- Ordered and disordered time series data, commonly used in finance
- Matrix data with column and row tags, since Pandas itself is also developed based on Numpy
103 days what did I write
A total of 16 articles were updated for Pandas over the 103 days:
It all starts with the explosion function
This article will focus on the use of explode, a function in Pandas
It implements a function similar to The explode function in Hive: count the information on the right from the information on the left
Article 2: Series type data
There are two data types in Pandas, one of which is Series.
A Series is a one-dimensional array structure consisting only of index and value.
Creating a DataFrame: 10 ways you can do it
This third article looks at 10 ways to create one of the most common data structures in Pandas: a DataFrame.
A DataFrame is a two-dimensional data structure that combines several Series into columns. Each column is a Series. In addition to having index and value, you have columns
After I wrote it, I remembered that I missed one way: to create it directly from the clipboard. When we have the data ready in the clipboard, run the following statement to create it directly:
df = pd.read_clipboard()
df
Copy the code
Part four: The number operation of Pandas
The first two articles introduced the creation of Series and DataFrame data structures, so here’s how to get the data we want out of them.
There are so many different ways to take numbers in Pandas that it took a total of three articles. The methods in the fourth article are as follows:
Fifth: Excellent! Pandas screens the data in a wide variety of ways
The following is an article about taking numbers in Pandas:
The last one is to play at Pandas
The final article describes how to take numbers in Pandas, focusing on three pairs of functions: are there subtle differences in how they are used
Chapter seven: The cornerstone of data processing: Data exploration
Before we import the data to Pandas for subsequent processing, we need to check the basic information of the data and have a preliminary understanding of the data, which generally includes the following information:
Pandas data Type Operations
When manipulating data, it is important for Pandas to ensure that data types are accurate. In this article, we will look at three common data type conversions and filter methods for Pandas.
- Cast using the astype() function
- Use custom functions to convert data types
- The functions provided by Pandas such as to_numeric(), to_datetime(), and so on are used for translation
- Use of select_dtypes functions
Pandas’ Groupby mechanism
Groupby statistics is a common method in work and data processing projects. This article explores the inner workings of GroupBy.
Article 10: Diagram of Pandas’ ranking mechanism
This article is an analogy to the rank and window functions in SQL, and describes how to use the Rank function in Pandas to do this:
- Row_number: order ranking, method=first in the rank function
- Rank: jump rank (method=min
- Dense_rank: dense rank, method=dense
Article 11: Diagram sort_values for Pandas
With the ranking, must also come to achieve a sort. The sort_values function is frequently used in daily life. TopN analysis of sales data is often needed, so sorting is required after grouping statistical data.
Article 12: Diagrams for missing values in Pandas
In general, the data is not perfect. We need to do a variety of early processing operations, missing value processing is one of them.
This article describes the manipulation of missing values by Pandas, including determining missing values, deleting missing values, and filling missing values.
- Df.isnull (), df.notnull() : These two functions are the inverse of each other
- Df.isna () : equivalent to df.isnull()
- Df.dropna () : deletes the missing value
- Df.fillna () : fills the missing value
Pandas repeats value processing
It is common to have duplicate values in data. This article mainly introduces two ways of processing duplicate values:
- Duplicated () : Determines whether there is a duplicated value
- Drop_duplicates () : deletes duplicate values
Part 14: Challenging SQL: Diagram Pandas’ data merge
For real business needs, our data may exist in different library tables, and SQL can be implemented through various joins, mainly through the merge function in Pandas.
In this article, we detailed how to use the merge parameters:
pd.merge(left, # Two data boxes to be merged
right,
how='inner'.# 'left', 'right', 'outer', 'inner', 'cross'
on=None.The default is the same key
left_on=None.# Specify different connection fields: the key is different, but the value of the key has the same content
right_on=None,
left_index=False.# Join by index
right_index=False,
sort=False.# Sort or not
suffixes=('_x'.'_y'), # Change the suffix
copy=True,
indicator=False.# Display field source
validate=None)
Copy the code
Text Pandas data merge: Concat, JOIN, and Append
In addition to the merge function, there are three other functions that can partially merge data in Pandas: concat, Join, and Append. Concat, in particular, is quite common.
Concat parameters:
pandas.concat(objs, # Merge objects
axis=0.# Merge direction, default is 0 vertical direction
join='outer'.# merge = inner (intersection) or outer (union
ignore_index=False.# Whether indexes will be re-indexed after merge
keys=None.# Add the name of the original data in the row index direction; It is mainly used for hierarchical indexing, which can be any list or array, tuple data or list array
levels=None.# Specifies the index to be used on the level of the hierarchical index, if keys is set
names=None.# The name of the line index, in list form
verify_integrity=False.# Check whether the row index is duplicate; There is an error
sort=False.# Sort unconnected axes
copy=True # Whether to make a deep copy
)
Copy the code
The join parameters:
dataframe.join(other, # Another data box to merge
on=None.# Connect key
how='left'.'left', 'right', 'outer', 'inner' default is left
lsuffix=' '.Suffix for the same key in the left (first) data box
rsuffix=' '.The suffix for the key of the second data box
sort=False) Whether to sort by join key; The default False
Copy the code
Append main parameters:
DataFrame.append(
other, # Append object
ignore_index=False.Whether to keep the original index
verify_integrity=False.# Check whether the row index is duplicate; There is an error
sort=False)
Copy the code
Article 16: Diagram stack and unstack for Pandas
Stack and unstack are mutually inverse functions. They rotate the Pandas data axis.
- Stack: rotates data columns into row index
- Unstack: Rotates the data row index into columns
- Both default to the innermost layer
Here are two images from the website to explain how they work:
What are the features of the article
In the process of writing, I have referred to the official website and a lot of materials, as well as some of my own experience in daily use. At the same time, I have simulated a lot of data, and summed up the following features:
- Examples abound: Each article is illustrated by simulation
- Illustrated: the article uses a lot of graphics to explain the use of functions, more intuitive, deepen the impression
- Close to reality: Much of the simulated data can be applied directly to real business scenarios
The follow-up work
What I’ve written so far is really just the tip of the iceberg for Pandas, and there’s a lot more to be done. If you actually run and understand the code, you will get a lot out of pandas. 🐂
Subsequent articles at Pandas will be updated continuously, and this will be a long process. More advanced techniques and examples will be included to help you master Pandas.
To get it, follow the official account “Youerhuatang” and reply to Pandas