This is the third day of my participation in Gwen Challenge
After a brief introduction to NUMpy, I thought about how to begin the formal pandas series. Other mountains of stone, can attack jade, read the next tutorial of others. Most people choose to start with data structures, some summary functions. But I don’t think the beginning of a series should be re-read to the reader from some “dictionary” material, which the reader then fails to read and forgets.
False beginnings
For PANDAS ‘data structures, just remember
- Series is one-dimensional
- Series has value, index, (name, data type)
- DataFrame is two-dimensional
- He added column indexes to Series (of course, the data was also made two-dimensional).
Formal beginning
God says data analysis requires data
Then we need to read some data from the file
filepath = "Mysterious data.csv"
data_table = pd.read_csv(filepath, encoding='utf-8')
Copy the code
You have to have a goal once you have the data.
Where else are we going to analyze it?
The simulation scenario is as follows:
There are 500 students in the school. Teacher Xiao Ming and teacher Li Xiaohua calculate the total score of students at the same time and write it into the table. If Xiao Ming and Li Xiaohua score the same, we think the score is correct. If the scores of two teachers are different, we will carry out SQL query according to the student number, and find the results of each subject in the generated table, and calculate the total score.
Demand analysis
Here we simply split the process into
- Find xiaoming and Li Xiaohua score different student numbers,
- Group and sum the scores of the students who scored wrong.
Find xiaoming and Li Xiaohua score different student numbers
data_clean =data_table_tmp[
~data_table_tmp.index.isin(
data_table_tmp[
data_table_tmp['Miss Xiao Ming's total score'] == data_table_tmp['Miss Li Siu-wah's total score']
].index.to_list()
)
]
Copy the code
I’m a beginner. You give me this long code to look at?
Take your time. I’m not saying this code is the answer.
First, look slowly out from the middle,
Isin = data_table_tmp = data_table_tmp = data_table_tmp = data_table_tmp = data_table_tmp
The output is roughly as follows:
The smart ones already figured it out, so why not just use not equal? So you don’t have to take the reverse.
Why even index a transition? Can not be judged by direct use? For example:
import pandas as pd
filepath = "Mysterious CSV".
data_table_tmp = pd.read_csv(filepath, encoding='utf-8')
data_clean =data_table_tmp[
data_table_tmp['Miss Xiao Ming's total score'] != data_table_tmp['Miss Li Siu-wah's total score']
]
data_clean
Copy the code
In fact, in a data analysis scenario, it is very likely that the code will become redundant because there are many factors to consider, so you should always check your code to see if “suddenly you look at the guy and the lights are out”.