Before getting a piece of data ready to do mining modeling, it is necessary to conduct preliminary data exploratory analysis first, after which a series of data pre-processing steps should be carried out. Because there are incomplete, inconsistent and abnormal data in the original data, and these “wrong” data will seriously affect the execution efficiency of data mining modeling and even lead to deviation of mining results, data cleaning is the first step. After data cleaning is completed, a series of processes such as data integration, transformation and normalization are carried out simultaneously, which is called data preprocessing. On the one hand, it can improve the quality of data, and on the other hand, it can make the data better adapt to the specific mining model. In actual work, the content of this part may account for 70% or even more of the whole work.
series
In chapter 1, you can use Pandas to analyze data.
In chapter 2, you can use pandas to analyze data in pandas.
In chapter 3, you can use Pandas to analyze data.
In Chapter 4, you can use pandas to analyze data in pandas.
In chapter 5, you can use pandas to analyze data in pandas.
Chapter 6 missing Data
In the next two chapters, we’ll look at some of the more troublesome types of data preprocessing, namely missing data and text data (especially hybrid text)
Pandas has also experimented with data types since moving into 1.0, particularly Nullable and String, and it’s important to understand these new features that may become mainstream in the future
import pandas as pd
import numpy as np
df = pd.read_csv('data/table_missing.csv')
df.head()
Copy the code
Missing observations and their types
1. Know the missing information
(a) ISNA and NOTNA methods
Using Series returns a Boolean list
df['Physics'].isna().head()
Copy the code
df['Physics'].notna().head()
Copy the code
df.isna().head()
Copy the code
df.isna().sum()
Copy the code
df.info()
Copy the code
Using the last column as an example, pick out the row with the missing value for that column
df[df['Physics'].isna()]
Copy the code
All means all non-missing values, and any means at least one of them is not missing
df[df.notna().all(1)]
Copy the code
2. Three missing symbols
(a) np. Nan
Np. nan is a troublesome thing. First of all, it is not equal to anything, not even itself
When comparing with equals, cells with nP.nan on both sides are automatically skipped, so the result does not matter
df.equals(df)
Copy the code
pd.Series([1,np.nan,3],dtype='bool')
Copy the code
s = pd.Series([True.False],dtype='bool')
s[1]=np.nan
s
Copy the code
After all tables are read, the default missing values are of nP.nan type, regardless of the type of data the column holds
So integer columns are converted to floating point; Characters cannot be converted to floating-point, so they must be merged into object (‘O’)
None is a little bit better than that, at least it’s going to be equal to itself
None= =None
Copy the code
pd.Series([None],dtype='bool')
Copy the code
s = pd.Series([True.False],dtype='bool')
s[0] =None
s
Copy the code
type(pd.Series([1.None[])1])
Copy the code
type(pd.Series([1.None],dtype='O') [1])
Copy the code
pd.Series([None]).equals(pd.Series([np.nan]))
Copy the code
NaT is a missing value for time series. It is built into Pandas and is treated exactly like the sequential version of Np.nan, not equal to its own, and equals is skipped
s_time = pd.Series([pd.Timestamp('20120101')] *5)
s_time
Copy the code
s_time[2] = None
s_time
Copy the code
s_time[2] = np.nan
s_time
Copy the code
s_time[2] = pd.NaT
s_time
Copy the code
type(s_time[2])
Copy the code
s_time[2] == s_time[2]
Copy the code
s_time.equals(s_time)
Copy the code
s = pd.Series([True.False],dtype='bool')
s[1]=pd.NaT
s
Copy the code
3. The Nullable type and NA symbol
This is a major change introduced in the 1.0 release of Pandas to address confusion and unify the missing value handling method
“The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).”——User Guide for Pandas v-1.0
Users are encouraged to use the new datatype and missing type pd.na
For this type, it differs from the original symbol on int by capitalizing the initial letter: ‘int’
s_original = pd.Series([1.2], dtype="int64")
s_original
Copy the code
s_new = pd.Series([1.2], dtype="Int64")
s_new
Copy the code
s_original[1] = np.nan
s_original
Copy the code
s_new[1] = np.nan
s_new
Copy the code
s_new[1] = None
s_new
Copy the code
s_new[1] = pd.NaT
s_new
Copy the code
For this type, the function is similar to the above; the notation is Boolean
s_original = pd.Series([1.0], dtype="bool")
s_original
Copy the code
s_new = pd.Series([0.1], dtype="boolean")
s_new
Copy the code
s_original[0] = np.nan
s_original
Copy the code
s_original = pd.Series([1.0], dtype="bool") # add a new line here because the previous assignment changes the bool type
s_original[0] = None
s_original
Copy the code
s_new[0] = np.nan
s_new
Copy the code
s_new[0] = None
s_new
Copy the code
s_new[0] = pd.NaT
s_new
Copy the code
bug
s = pd.Series(['dog'.'cat'])
s[s_new]
Copy the code
This type was a major innovation of 1.0, and was intended, in part, to distinguish between the otherwise ambiguous object type. String will be mentioned briefly here, as it is the subject of Chapter 7
It’s also essentially a Nullable type, because it doesn’t change the type because it contains deletions
s = pd.Series(['dog'.'cat'],dtype='string')
s
Copy the code
s[0] = np.nan
s
Copy the code
s[0] = None
s
Copy the code
s = pd.Series(["a".None."b"], dtype="string")
s.str.count('a')
Copy the code
s2 = pd.Series(["a".None."b"], dtype="object")
s2.str.count("a")
Copy the code
s.str.isdigit()
Copy the code
s2.str.isdigit()
Copy the code
4. Characteristics of NA
(a) The logical operation only needs to see whether the result of the logical operation depends on the value of pd.NA. If so, the result is still NA; if not, the result is directly calculated
(b) Arithmetic and comparison just remember that all but two of the following results are NA
5. Convert_dtypes method
This function, which converts columns to Nullable as they are read, is new in 1.0
pd.read_csv('data/table_missing.csv').dtypes
Copy the code
pd.read_csv('data/table_missing.csv').convert_dtypes().dtypes
Copy the code
2. Operation and grouping of missing data
1. Plus and multiplication rules
With addition, the missing value is 0
s = pd.Series([2.3,np.nan,4])
s.sum()
Copy the code
s.prod()
Copy the code
s.cumsum()
Copy the code
s.cumprod()
Copy the code
s.pct_change()
Copy the code
2. Missing value in groupby method
Groups that are automatically ignored as missing values
df_g = pd.DataFrame({'one': ['A'.'B'.'C'.'D',np.nan],'two':np.random.randn(5)})
df_g
Copy the code
df_g.groupby('one').groups
Copy the code
3. Filling and elimination
1. Fillna method
(a) Value filling and forward and backward filling (equivalent to ffILL method and bfill method respectively)
df['Physics'].fillna('missing').head()
Copy the code
df['Physics'].fillna(method='ffill').head()
Copy the code
df['Physics'].fillna(method='backfill').head()
Copy the code
df_f = pd.DataFrame({'A': [1.3,np.nan],'B': [2.4,np.nan],'C': [3.5,np.nan]})
df_f.fillna(df_f.mean())
Copy the code
df_f.fillna(df_f.mean()[['A'.'B']])
Copy the code
2. Dropna method
(a) Axis parameters
df_d = pd.DataFrame({'A':[np.nan,np.nan,np.nan],'B':[np.nan,3.2].'C': [3.2.1]})
df_d
Copy the code
df_d.dropna(axis=0)
Copy the code
df_d.dropna(axis=1)
Copy the code
df_d.dropna(axis=1,how='all')
Copy the code
df_d.dropna(axis=0,subset=['B'.'C'])
Copy the code
4. Interpolation
1. Linear interpolation
(a) Index-independent linear interpolation
By default, interpolate interpolates linear values that are missing
s = pd.Series([1.10.15.- 5.2 -,np.nan,np.nan,28])
s
Copy the code
s.interpolate()
Copy the code
s.interpolate().plot()
Copy the code
s.index = np.sort(np.random.randint(50.300.8))
s.interpolate()
# values
Copy the code
s.interpolate().plot()
# The last three points are not linear (if they are almost linear, please rerun one of the above blocks, this is due to randomness)
Copy the code
The index and time options in method make interpolation linearly dependent on indexes, that is, linear functions that are interpolated as indexes
s.interpolate(method='index').plot()
# You can see the difference
Copy the code
s_t = pd.Series([0,np.nan,10]
,index=[pd.Timestamp('2012-05-01'),pd.Timestamp('2012-05-07'),pd.Timestamp('2012-06-03')])
s_t
Copy the code
s_t.interpolate().plot()
Copy the code
s_t.interpolate(method='time').plot()
Copy the code
2. Advanced interpolation
Advanced here refers to linear interpolation, such as spline interpolation, polynomial interpolation, Aquimar interpolation, etc. (Scipy required), see here for details of the method
Only one official example is given for this section, as interpolation methods are the subject of numerical analysis and not the basics in Pandas:
import pandas as pd
import numpy as np
ser = pd.Series(np.arange(1.10.1.25.) * *2 + np.random.randn(37))
missing = np.array([4.13.14.15.16.17.18.20.29])
ser[missing] = np.nan
methods = ['linear'.'quadratic'.'cubic']
df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
df.plot()
Copy the code
Interpolate Constraint parameter in interpolate
(a) Limit indicates the maximum number of inserts
s = pd.Series([1,np.nan,np.nan,np.nan,5])
s.interpolate(limit=2)
Copy the code
s = pd.Series([np.nan,np.nan,1,np.nan,np.nan,np.nan,5,np.nan,np.nan,])
s.interpolatae(limit_direction='backward')
Copy the code
Code and data address: github.com/XiangLinPro…
In addition, the blogger collects thousands of good commonly used books that he has seen or heard over the years. Maybe the book you want to find is right here, including most books and interview experience topics in the Internet industry and so on. There are artificial intelligence series (commonly used deep learning frameworks TensorFlow, PyTorch, Keras). NLP, machine learning, deep learning, etc.), Big data series (Spark,Hadoop,Scala, Kafka, etc.), programmer required series (C, C++, Java, Data Structures, Linux, Design patterns, databases, etc.
More articles see the original wechat public number “50 cents programmer”, we grow up together, learn together. Always pure, kind, tender love life. Pay attention to reply [ebook] can receive oh.
All coincidences are either destined by god or a person secretly working on it.
Have a harvest? Hope the old iron people come to a triple whammy, give more people to see this article
1, give me a thumbs up, can let more people see this article, thank you dear.
2, dear friends, pay attention to my original wechat public number “fifty cents programmer”, we grow up together, learn together. Always pure, kind, tender love life. There are a lot of resources out there.
Here’s a Github with lots of dry stuff on it:Github.com/XiangLinPro…
Whatever I believed, I did; and whatever I did, I did with my whole heart and mind.
Whatever I believed, I did; Whatever I did, I did with my whole heart and soul.
About Datawhale
Datawhale is an open source organization focusing on the field of data science and AI. It gathers excellent learners from universities and well-known enterprises in various fields, and gathers a group of team members with open source spirit and exploration spirit. Datawhale, with the vision of “for the learner, grow with learners”, encourages true self-presentation, openness and inclusiveness, mutual trust and mutual assistance, dare to trial and error and dare to take responsibility. At the same time, Datawhale explores open source content, open source learning and open source solutions with the concept of open source, enabling talent cultivation, facilitating talent growth, and establishing connections between people, people and knowledge, people and enterprises, and people and the future.
2020.5.22 in where