Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities
One, foreword
In this article, I will explain how to use the functions merge, apply, and pivot_table for pandas, and how to use them in detail.
If you like this article, it will be broadcast live to see how it will be reproduced. There are three pandas.
Ii. Summary of this paper
- Pandas merge by Diving
- Pandas apply by Pluto
- Pandas pivot_table by graphite
First, you need to have the Python environment installed and Python development tools installed on your computer.
If you haven’t installed it yet, check out the following article:
If you only use Python to process data, crawlers, data analysis, automatic scripts, and machine learning, you are advised to use the Python basic environment and Jupyter. For details, see Windows/Mac Installation and Using The Python Environment and Jupyter Notebook
If you want to use Python to develop web projects, you are advised to use the Python foundation environment +Pycharm to install and use Python/ Pycharm.
Author: Lao Biao Link: juejin.cn/post/701147… The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.
三, Pandas merge by diving
3.1 Usage of the Merge Function
The merge() function in Pandas is similar to the join function in SQL. It merges different datasets according to certain fields (attributes) to create a new dataset.
3.2 Parameters of the Merge function
- Usage:
DataFrame1. Merge (DataFrame2, how = 'inner', on =None, left_on=None,
right_on=None, left_index=False,
right_index=False, sort=FalseThe suffixes = (' _x ', '_y))Copy the code
- Parameter Description:
-
- Inner /outer/left/right;
-
- On: joins by a field and must exist in two dateframes (left_ON and right_ON respectively if they do not exist at the same time).
-
- Left_on: left join, using the column used as the join key in DataFrame1;
-
- Right_on: right-join with the column used as the join key in DataFrame2;
-
- Left_index: Use the DataFrame1 row index as the join key;
-
- Right_index: Use DataFrame2 row index as join key;
-
- Sort: sort the merged data by join key, default True;
-
- Suffixes: add suffixes _x,_y to the new data set to distinguish duplicate columns in the two data sets.
3.3 Application of the Merge function
3.3.1 Merge Common application
import pandas as pd
Define the dataset and print it
left = pd.DataFrame({'key1': ['K0'.'K0'.'K1'.'K2'].'key2': ['K0'.'K1'.'K0'.'K1'].'A': ['A0'.'A1'.'A2'.'A3'].'B': ['B0'.'B1'.'B2'.'B3']})
right = pd.DataFrame({'key1': ['K0'.'K1'.'K1'.'K2'].'key2': ['K0'.'K0'.'K0'.'K0'].'C': ['C0'.'C1'.'C2'.'C3'].'D': ['D0'.'D1'.'D2'.'D3']})
print(left)
print('-- -- -- -- -- -- -- -- -- -- -- --')
print(right)
Copy the code
- Single field join
# merge according to key1 column and print
res = pd.merge(left, right, on='key1')
print(res)
Copy the code
- Multi-field join
['left', 'right','outer', 'inner'] ['left', 'right','outer', 'inner']
res = pd.merge(left, right, on=['key1'.'key2'], how='inner')
print(res)
res = pd.merge(left, right, on=['key1'.'key2'], how='outer')
print(res)
res = pd.merge(left, right, on=['key1'.'key2'], how='left') # merge with left
print(res)
res = pd.merge(left, right, on=['key1'.'key2'], how='right') # merge with right as main
print(res)
Copy the code
3.3.2 Merge Advanced Application
- Indicator sets merged column data sources
# indicator sets merge column data sources
df1 = pd.DataFrame({'coll': [0.1].'col_left': ['a'.'b']})
df2 = pd.DataFrame({'coll': [1.2.2].'col_right': [2.2.2]})
print(df1)
print('-- -- -- -- -- -- -- -- --')
print(df2)
# Merge according to COLL with Indicator =True and print
res = pd.merge(df1, df2, on='coll', how='outer', indicator=True)
print(res)
"Left_only indicates that the data is from the left table. Right_only indicates that the data is from the right table. Both indicates that the data is in both tables.
Copy the code
# Customize the indicator column name and print it out
res = pd.merge(df1, df2, on='coll', how='outer', indicator='indicator_column')
print(res)
Copy the code
- Merge by index
# merge by index
Define the dataset and print it out
left = pd.DataFrame({'A': ['A0'.'A1'.'A2'].'B': ['B0'.'B1'.'B2']},
index = ['K0'.'K1'.'K2'])
right = pd.DataFrame({'C': ['C0'.'C2'.'C3'].'D': ['D0'.'D2'.'D3']},
index = ['K0'.'K2'.'K3'])
print(left)
print('-- -- -- -- -- -- -- -- --')
print(right)
Copy the code
# how='outer' and print
res = pd.merge(left, right, left_index=True, right_index=True, how='outer')
print(res)
# merge the index of the left and right data sets, how='inner', and print
res = pd.merge(left, right, left_index=True, right_index=True, how='inner')
print(res)
Copy the code
- We will address our Overlapping areas
Address our Overlapping areas
Define the dataset
boys = pd.DataFrame({'k': ['K0'.'K1'.'K2'].'age': [1.2.3]})
girls = pd.DataFrame({'k': ['K0'.'K1'.'K3'].'age': [4.5.6]})
print(boys)
print('-- -- -- -- -- -- -- -- --')
print(girls)
Use Suffixes to solve overlapping problems
Suffixes = 'age'; suffixes = 'age'; suffixes = 'age'
res = pd.merge(boys, girls, on='k', suffixes=['_boy'.'_girl'], how='inner')
print(res)
Copy the code
Pandas apply by Pluto
4. Pandas apply by Pluto
The apply function simplifies processing by quantifying the dataframe in the row or column directions.
The basic form of the apply function:
DataFrame.apply(func,
axis=0, broadcast=False,
raw=False, reduce=None, args=(), **kwds)
Copy the code
The first two parameters are the axis of the func operation and the axis of the axis operation, respectively. The default axis is AXIS =0, and the func operation is passed in as a sequence of columns. Setting axis=1 means that the calculation is performed by row.
In cases where the operation function is not complicated, the first argument is usually a lambda function. When a function is complex, you can write another function to call it. Here is an example:
import pandas as pd
df = pd.DataFrame({'A': [3.1.4.1.5.9.None.6].'B': [1.2.3.None.5.6.7.8]})
d = df.apply(lambda x: x.fillna(x.mean()))
print(df)
print('-- -- -- -- -- -- -- -- -- --')
print(d)
Copy the code
Data before processing:
Processed data:
As you can see, the code above populates nan values with the mean values of the nan columns by apply.
By default, the axis parameter value is 0, indicating that a specific function operation is performed on the row direction, that is, on each column.
We can evaluate each row by setting Axis =1. For example, if I set the previous example to axis=1, the result will be:
You can see that it populates the nan values with the mean of each row.
Apply can also be called by writing another function:
import pandas as pd
df = pd.DataFrame({'A': [3.1.4.1.5.9.None.6].'B': [1.2.3.None.5.6.7.8]})
def add(x) :
return x+1
d = df.apply(add, axis=1)
print(df)
print('-- -- -- -- -- -- -- -- -- --')
print(d)
Copy the code
This function increments each column by one:
Note: row direction, not operation on row.
For example, if a row has [A, b, C, D], the row direction operation refers to the calculation of a column, then B column, then C column, and finally D column, so the row direction refers to the direction of operation order only.
(Don’t overdo it, remember that Axis =0 evaluates columns and axis=1 evaluates rows.)
I like pandas. I like pandas
First of all, I would like to thank [old watch Max] push the video and document, are more essence, for beginners to point out the direction.
Today is the fourth day of learning Pandas. What impressed me most is that pandas can add, delete, search, and organize EXCEL data efficiently. I thought I was quite good at this level.
A simple Apply case is as follows:
Tang’s monk teacher and apprentice add white dragon horse a line of 5 people to attend adult exam, exam course contains language, mathematics, English in all 3, want to know now 3 course achievement all not less than what does the person of 85 minutes have?
import pandas as pd
df = pd.DataFrame({'Chinese': [93.80.85.76.58].'mathematics': [87.99.95.85.70].'English': [80.85.97.65.88]},
index=[Sun Wukong.'Pig Eight Quit'.'Sand Monk'."Tang's monk.'White Horse'])print(df)
print('-- -- -- -- -- -- -- -- -- -- -')
df1 = df.loc[df['Chinese'].apply(lambda x:85<=x<100)] \
.loc[df['English'].apply(lambda x:85<=x<100)] \
.loc[df['mathematics'].apply(lambda x:85<=x<100)]
print(df1)
Copy the code
Pandas pivot_table by graphite
In PANDAS, pivot_table and pivot functions can also be used for pivot_table. The former can be seen as an enhanced version of the latter.
The basic form of the pivot_table function:
DataFrame.pivot_table(self, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)
Copy the code
Pivot_tabel has low data format requirements and supports parameters such as AGgfunc and FillValue, making it more widely used.
The parameters of the pivot_table function are values(cell values), index(index), and columns(column names), which form the basic structure of a Pivot_table.
The agGFunc method is a bit more complicated. The default is to compute the mean (for a column of values), but you can also compute other statistics or get conversions of data types, and you can use more than one statistic method at a time.
In short, pivot_table can transform dimensions to observe data for “perspective” purposes.
Case description:
import numpy as np
import pandas as pd
df = pd.DataFrame({'brand': ['apple'.'samsung'.'huawei'.'OPPO'.Nokia.'millet'].'country': ['the United States'.'Korea'.'China'.'China'.'the United States'.'China'].'system': ['OS'.'Android'.'Harmony'.'Android'.'Android'.'Android'].'score': [94.7.92.8.96.8.89.3.88.4.91.2]})
df
Copy the code
Group by country. By default, calculate the average value of the column
df.pivot_table(index='country')
Copy the code
# Group by country, calculate the number of brands in each country in addition to the mean score (no repetition)
df.pivot_table(index='country',aggfunc={'score':np.mean,'brand':lambda x : len(x.unique())})
Copy the code
# Group by country, system as column name, score as value in table (take mean repeatedly), take corresponding data to generate a new table
df.pivot_table(index='country',columns='system',values='score')
Copy the code
# The average score of the highest and lowest scores of mobile phones in each country is counted, and the empty value is filled with zero
df.pivot_table(index='country',columns='system',values='score',aggfunc=[max.min,np.mean],fill_value=0)
Copy the code
In conclusion, I would like to extend my warm welcome to the Pandas. In the final chapter of this article, I will provide a comprehensive description of how pandas can be used for data processing and analysis in Python. I will review the description of the book for the benefit of pandas users.
For today, feel free to leave a comment in the comments about learning this article or any other tips you would like to recommend