Data analysis - Missing value processing

We access to the data (especially when data volume is larger), is likely to appear such problems as lack of data, abnormal data, the data processing is in the data analysis is very important and must be a link, in order to do data analysis as far as possible, reduce the occurrence of abnormal, and more accurate analysis conclusion, so before doing data analysis, data processing is particularly necessary

Platform to use: Jupyter Notebook

Missing value handling

Missing value judgment

Python reads either CSV or Excel data. If the excel cell is empty, NaN is displayed for pandas

Method to determine missing value data: ISNULL, notnull

Isnull: True indicates missing, False indicates non-missing
Notnull: True indicates not missing, False indicates missing

Start by importing the Python data analysis required toolkit

import numpy as np
import pandas as pd
__author__='don't let'Copy the code

Generate a tabular two-dimensional array df

Create a tabular two-dimensional array
df=pd.DataFrame({'a': [34.6.20,np.nan,56].'b': ['juejin'.'number'.'one'.'good',np.nan]})Copy the code

The output is as follows:

Determine whether missing values exist in the data and filter non-missing values:

Check whether the two-dimensional array df is missing
print(df.notnull(),'\n')
Check if column A is missing by index
print(df['a'].notnull(),'\n')
Filter column A without missing value array
print(df[df['a'].notnull()])Copy the code

The output is as follows:

       a      b
0   True   True
1   True   True
2   True   True
3  False   True
4   True  False 

0     True
1     True
2     True
3    False
4     True
Name: a, dtype: bool 

      a       b
0  34.0  juejin
1   6.0  number
2  20.0     one
4  56.0     NaNCopy the code

Deletion of missing values

Filtering by notnull Boolean sequence values above is also a way to remove missing values

The deletion of the missing data needs to be handled according to the actual data situation and service situation. Sometimes all the missing data needs to be deleted, sometimes part of the missing data needs to be deleted, and sometimes only the specified missing data needs to be deleted.

Drop missing values: Dropna (Axis)

The default parameter axis= 0 is used to delete row data, and when axis=1 is used to delete column data (but axis=1 is not usually selected; if it is 1, an entire variable is deleted directly).
Passing thRESH =n preserves rows with at least n non-nan data

Create a tabular two-dimensional array
df2=pd.DataFrame([[1.2.3], ['juejin',np.nan,np.nan],['a'.'b',np.nan],[np.nan,np.nan,np.nan],['d'.'j'.'h']],
                 columns=list('ABC'))
print(df2,'\n')
Delete all rows with missing values
print(df2.dropna(),'\n')
Delete some rows with missing values and keep at least n non-nan rows (e.g., keep at least one non-nan row)
print(df2.dropna(thresh=1),'\n')
Delete all rows from A column with missing values, same as above Boolean sequence filter
print(df2[df2['A'].notnull()])Copy the code

The output is as follows:

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
3     NaN  NaN  NaN
4       d    j    h 

   A  B  C
0  1  2  3
4  d  j  h 

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
4       d    j    h 

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
4       d    j    hCopy the code

Missing values fill/replace

Fillna (value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Value parameter: fill value
Method parameters: pad/ffill → fill with previous data, backfill/bfill → fill with later data

Replace (to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’, axis=None)

To_replacec parameter: the value to be replaced
Value: Indicates the replacement value

The following is an example:

import copy
df3=pd.DataFrame([[1.2.3], ['juejin',np.nan,np.nan],['a'.'b',np.nan],['k',np.nan,np.nan],['d'.'j'.'h']],
                 columns=list('ABC'))
df4=copy.deepcopy(df3)
print(df3,'\n')
All missing values are filled with zeros
print(df3.fillna(0),'\n')
#method= 'pad' so that each missing value in column B is filled with the previous value of the missing value
df3['B'].fillna(method='pad',inplace=True)
print(df3,'\n')
# replace with
print(df4,'\n')
df4.replace(np.nan,'juejin',inplace = True)
print('Replace the missing value with juejin\n',df4)Copy the code

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
3       k  NaN  NaN
4       d    j    h 

        A  B  C
0       1  2  3
1  juejin  0  0
2       a  b  0
3       k  0  0
4       d  j  h 

        A  B    C
0       1  2    3
1  juejin  2  NaN
2       a  b  NaN
3       k  b  NaN
4       d  j    h 

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
3       k  NaN  NaN
4D j H replace the missing value with Juejin A B C0       1       2       3
1  juejin  juejin  juejin
2       a       b  juejin
3       k  juejin  juejin
4       d       j       hCopy the code

Missing value interpolation

The above mentioned missing value filling, but in the actual data processing process, missing value processing is not to randomly find all data filling, but to carry out interpolation filling for each local missing value.

Several typical missing value interpolation methods are selected here:

Median/mode/mean interpolation
Near value interpolation
Lagrangian interpolation

Median/mode/mean interpolation

Generate a one-dimensional array
s1=pd.Series([6.4.2.5.4.3.3.7,np.nan,3.9,np.nan,1])
print(s1,'\n')
med=s1.median()# the median
mod=s1.mode()# the number
avg=s1.mean() The average #
print('Median, mode, mean respectively: %.2f,%.2f,%.2f'%(med,mod,avg))
# Take the average
s1.fillna(avg)Copy the code

0     6.0
1     4.0
2     2.0
3     5.0
4     4.0
5     3.0
6     3.0
7     7.0
8     NaN
9     3.0
10    9.0
11    NaN
12    1.0Dtype: float64 Median, mode, mean are:4.00.3.00.4.270     6.000000
1     4.000000
2     2.000000
3     5.000000
4     4.000000
5     3.000000
6     3.000000
7     7.000000
8     4.272727
9     3.000000
10    9.000000
11    4.272727
12    1.000000
dtype: float64Copy the code

Adjacent value interpolation

This point is actually mentioned in the missing value filling above, mainly the parameter method. You can choose to fill the data in front of the missing value or in the same position as the data in the missing value. Please refer to DF3

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Method parameters: pad/ffill → fill with previous data, backfill/bfill → fill with later data

Lagrange interpolation

In many practical problems, functions are used to express certain internal relations or laws, and many functions can only be understood through experiments and observations. For example, if a physical quantity in practice is observed and the corresponding observed value is obtained in several different places, Lagrange interpolation method can find a polynomial, which exactly takes the observed value at each observed point. Such polynomials are called Lagrange polynomials. Mathematically, Lagrange interpolation gives a polynomial function that passes through exactly a number of known points on a two-dimensional plane.

Limited to space, here is a rough explanation of the calculation process of Lagrangian interpolation

According to mathematical knowledge, for n points known on the plane, we can find an n-1 degree polynomial:

When we know the first n minus 1 coordinate points (x1, y1), (x2, y2)…… When (xn-1, yn-1), a multivariate equation can be obtained by substituting the above formula

From this multivariate equation, we can calculate the parameters A0, A1,….. The value of an-1, knowing each parameter of this multivariate equation, namely knowing a function equation between y and x, passing in the x value, can calculate the corresponding missing y value (an approximate value), similar to the above calculation process is called Lagrange interpolation.

In Python, there is a very convenient Lagrangian interpolation tool library, the specific use and implementation of the process directly in the following code example implementation

We randomly select a group of data (3,6), (7,9), (8,5), (9,8) to calculate the function equation of these points by Lagrange interpolation method, and then input the value of x that we need to interpolate, and the value of y can be obtained naturally.

# Import Lagrange interpolation calculation and plotting package
from scipy.interpolate import lagrange
import matplotlib.pyplot as plt
% matplotlib inline
Create an arbitrary two-dimensional array with missing values
s2=pd.DataFrame({'x': [3.7.12.8.9].'y': [6.9,np.nan,5.8]})
# x
x=[3.7.8.9]
# y value
y=[6.9.5.8]
# Generate a scatter diagram of these points
plt.scatter(x,y)
Figure out the equation of the function
print(lagrange(x,y))
# Select an x=12 and calculate the insert value
print('Interpolate 12 as %.2f' % lagrange(x,y)(12))Copy the code

The resulting function (the numbers 3 and 2 above represent x³ and x²), the corresponding missing value interpolation, and the scatter diagram are as follows:

        3        2
0.7417 x - 14.3 x + 85.16 x - 140.8The interpolation12for103.50

Copy the code

So when x=12, the corresponding missing value can be replaced by 103.50

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Data analysis – Missing value processing

Missing value handling

Missing value judgment

Deletion of missing values

Missing values fill/replace

Missing value interpolation

Data analysis – Missing value processing

Missing value handling

Missing value judgment

Deletion of missing values

Missing values fill/replace

Missing value interpolation

Related Posts

Netease Yunxin joins hands with Yaotai to create a benchmark case of meta-universe commercialization practice

Spark: Standalone (Standalone cluster Lake environment)- Set up and use

Solving 6X6 workshop scheduling problem based on MATLAB particle swarm Optimization algorithm