We access to the data (especially when data volume is larger), is likely to appear such problems as lack of data, abnormal data, the data processing is in the data analysis is very important and must be a link, in order to do data analysis as far as possible, reduce the occurrence of abnormal, and more accurate analysis conclusion, so before doing data analysis, data processing is particularly necessary
Platform to use: Jupyter Notebook
Missing value handling
Missing value judgment
Python reads either CSV or Excel data. If the excel cell is empty, NaN is displayed for pandas
Method to determine missing value data: ISNULL, notnull
- Isnull: True indicates missing, False indicates non-missing
- Notnull: True indicates not missing, False indicates missing
Start by importing the Python data analysis required toolkit
import numpy as np
import pandas as pd
__author__='don't let'Copy the code
Generate a tabular two-dimensional array df
Create a tabular two-dimensional array
df=pd.DataFrame({'a': [34.6.20,np.nan,56].'b': ['juejin'.'number'.'one'.'good',np.nan]})Copy the code
The output is as follows:
Determine whether missing values exist in the data and filter non-missing values:
Check whether the two-dimensional array df is missing
print(df.notnull(),'\n')
Check if column A is missing by index
print(df['a'].notnull(),'\n')
Filter column A without missing value array
print(df[df['a'].notnull()])Copy the code
The output is as follows:
a b
0 True True
1 True True
2 True True
3 False True
4 True False
0 True
1 True
2 True
3 False
4 True
Name: a, dtype: bool
a b
0 34.0 juejin
1 6.0 number
2 20.0 one
4 56.0 NaNCopy the code
Deletion of missing values
Filtering by notnull Boolean sequence values above is also a way to remove missing values
The deletion of the missing data needs to be handled according to the actual data situation and service situation. Sometimes all the missing data needs to be deleted, sometimes part of the missing data needs to be deleted, and sometimes only the specified missing data needs to be deleted.
Drop missing values: Dropna (Axis)
- The default parameter axis= 0 is used to delete row data, and when axis=1 is used to delete column data (but axis=1 is not usually selected; if it is 1, an entire variable is deleted directly).
- Passing thRESH =n preserves rows with at least n non-nan data
Create a tabular two-dimensional array
df2=pd.DataFrame([[1.2.3], ['juejin',np.nan,np.nan],['a'.'b',np.nan],[np.nan,np.nan,np.nan],['d'.'j'.'h']],
columns=list('ABC'))
print(df2,'\n')
Delete all rows with missing values
print(df2.dropna(),'\n')
Delete some rows with missing values and keep at least n non-nan rows (e.g., keep at least one non-nan row)
print(df2.dropna(thresh=1),'\n')
Delete all rows from A column with missing values, same as above Boolean sequence filter
print(df2[df2['A'].notnull()])Copy the code
The output is as follows:
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
3 NaN NaN NaN
4 d j h
A B C
0 1 2 3
4 d j h
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
4 d j h
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
4 d j hCopy the code
Missing values fill/replace
Fillna (value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
- Value parameter: fill value
- Method parameters: pad/ffill → fill with previous data, backfill/bfill → fill with later data
Replace (to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’, axis=None)
- To_replacec parameter: the value to be replaced
- Value: Indicates the replacement value
The following is an example:
import copy
df3=pd.DataFrame([[1.2.3], ['juejin',np.nan,np.nan],['a'.'b',np.nan],['k',np.nan,np.nan],['d'.'j'.'h']],
columns=list('ABC'))
df4=copy.deepcopy(df3)
print(df3,'\n')
All missing values are filled with zeros
print(df3.fillna(0),'\n')
#method= 'pad' so that each missing value in column B is filled with the previous value of the missing value
df3['B'].fillna(method='pad',inplace=True)
print(df3,'\n')
# replace with
print(df4,'\n')
df4.replace(np.nan,'juejin',inplace = True)
print('Replace the missing value with juejin\n',df4)Copy the code
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
3 k NaN NaN
4 d j h
A B C
0 1 2 3
1 juejin 0 0
2 a b 0
3 k 0 0
4 d j h
A B C
0 1 2 3
1 juejin 2 NaN
2 a b NaN
3 k b NaN
4 d j h
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
3 k NaN NaN
4D j H replace the missing value with Juejin A B C0 1 2 3
1 juejin juejin juejin
2 a b juejin
3 k juejin juejin
4 d j hCopy the code
Missing value interpolation
The above mentioned missing value filling, but in the actual data processing process, missing value processing is not to randomly find all data filling, but to carry out interpolation filling for each local missing value.
Several typical missing value interpolation methods are selected here:
- Median/mode/mean interpolation
- Near value interpolation
- Lagrangian interpolation
Median/mode/mean interpolation
Generate a one-dimensional array
s1=pd.Series([6.4.2.5.4.3.3.7,np.nan,3.9,np.nan,1])
print(s1,'\n')
med=s1.median()# the median
mod=s1.mode()# the number
avg=s1.mean() The average #
print('Median, mode, mean respectively: %.2f,%.2f,%.2f'%(med,mod,avg))
# Take the average
s1.fillna(avg)Copy the code
0 6.0
1 4.0
2 2.0
3 5.0
4 4.0
5 3.0
6 3.0
7 7.0
8 NaN
9 3.0
10 9.0
11 NaN
12 1.0Dtype: float64 Median, mode, mean are:4.00.3.00.4.270 6.000000
1 4.000000
2 2.000000
3 5.000000
4 4.000000
5 3.000000
6 3.000000
7 7.000000
8 4.272727
9 3.000000
10 9.000000
11 4.272727
12 1.000000
dtype: float64Copy the code
Adjacent value interpolation
This point is actually mentioned in the missing value filling above, mainly the parameter method. You can choose to fill the data in front of the missing value or in the same position as the data in the missing value. Please refer to DF3
fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
Method parameters: pad/ffill → fill with previous data, backfill/bfill → fill with later data
Lagrange interpolation
In many practical problems, functions are used to express certain internal relations or laws, and many functions can only be understood through experiments and observations. For example, if a physical quantity in practice is observed and the corresponding observed value is obtained in several different places, Lagrange interpolation method can find a polynomial, which exactly takes the observed value at each observed point. Such polynomials are called Lagrange polynomials. Mathematically, Lagrange interpolation gives a polynomial function that passes through exactly a number of known points on a two-dimensional plane.
Limited to space, here is a rough explanation of the calculation process of Lagrangian interpolation
According to mathematical knowledge, for n points known on the plane, we can find an n-1 degree polynomial:
When we know the first n minus 1 coordinate points (x1, y1), (x2, y2)…… When (xn-1, yn-1), a multivariate equation can be obtained by substituting the above formula
From this multivariate equation, we can calculate the parameters A0, A1,….. The value of an-1, knowing each parameter of this multivariate equation, namely knowing a function equation between y and x, passing in the x value, can calculate the corresponding missing y value (an approximate value), similar to the above calculation process is called Lagrange interpolation.
In Python, there is a very convenient Lagrangian interpolation tool library, the specific use and implementation of the process directly in the following code example implementation
We randomly select a group of data (3,6), (7,9), (8,5), (9,8) to calculate the function equation of these points by Lagrange interpolation method, and then input the value of x that we need to interpolate, and the value of y can be obtained naturally.
# Import Lagrange interpolation calculation and plotting package
from scipy.interpolate import lagrange
import matplotlib.pyplot as plt
% matplotlib inline
Create an arbitrary two-dimensional array with missing values
s2=pd.DataFrame({'x': [3.7.12.8.9].'y': [6.9,np.nan,5.8]})
# x
x=[3.7.8.9]
# y value
y=[6.9.5.8]
# Generate a scatter diagram of these points
plt.scatter(x,y)
Figure out the equation of the function
print(lagrange(x,y))
# Select an x=12 and calculate the insert value
print('Interpolate 12 as %.2f' % lagrange(x,y)(12))Copy the code
The resulting function (the numbers 3 and 2 above represent x³ and x²), the corresponding missing value interpolation, and the scatter diagram are as follows:
3 2
0.7417 x - 14.3 x + 85.16 x - 140.8The interpolation12for103.50
Copy the code
So when x=12, the corresponding missing value can be replaced by 103.50